# Welcome to Sentiment Classification. A personal project by Levi Feldman.

Sentiment Classification refers to figuring out if a group of text (i.e a movie review) is more positive then negative or more negative then positive, wether or not the author of said text meant to be positive or negative.

This can be very usefull when let's say someone comes out with a movie and millions of people watch it. The movie maker wants to know around how many people liked or disliked his movie but it would take him months to read through every review and see if they are positive or negative. So that's where deep learning comes in, we will try to create a network that can predict with at least around an 85% accuracy wether or not a given review or reviews are positive or negative. 

In this project I will be building a neural network from scratch to analzye movie reviews and classify them by sentiment. I will be training the network using 25000 labeled movie reviews from IMDB. 

This notebook is an example of the process of solving a problem with deep learning by: 
- Developing a predictive theory
- Verifying the theory
- Proccessing data for the network
- Building a network
- Reducing noise and analyzing inefficiencys. 




# Looking at the datasets

By looking at the data I will try to figure out the best way for a neural network to learn how to differentiate between positive reviews and negative reviews.

In [1]:
g = open('reviews.txt','r') # put reviews into a list
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # put labels into a list
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

### Clean data, make reviews lowercase ###
for i in range(len(reviews)):
    reviews[i] = reviews[i].lower()

print("There are {} reviews and {} labels".format(len(reviews), len(labels)))

There are 25000 reviews and 25000 labels


In [2]:
reviews[5]

'this film lacked something i couldn  t put my finger on at first charisma on the part of the leading actress . this inevitably translated to lack of chemistry when she shared the screen with her leading man . even the romantic scenes came across as being merely the actors at play . it could very well have been the director who miscalculated what he needed from the actors . i just don  t know .  br    br   but could it have been the screenplay  just exactly who was the chef in love with  he seemed more enamored of his culinary skills and restaurant  and ultimately of himself and his youthful exploits  than of anybody or anything else . he never convinced me he was in love with the princess .  br    br   i was disappointed in this movie . but  don  t forget it was nominated for an oscar  so judge for yourself .  '

In [3]:
labels[5]

'NEGATIVE'

In [4]:
def show_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

print("Labels \t : \t Reviews\n")
show_review_and_label(12816)
show_review_and_label(21934)
show_review_and_label(5297)
show_review_and_label(4998)
show_review_and_label(6267)
show_review_and_label(2137)

Labels 	 : 	 Reviews

POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
NEGATIVE	:	this movie is terrible but it has some good effects .  ...


# Predictive theory:

So naturally, looking at the data, I see that positive words like "excellent", and "fascinating" are mentioned more often in the positive reviews then in the negative reviews. At the same time negative words like "bad", and "terrible" are mentioned way more in the negative reviews then in the positive reviews.   

Next I will try to validate my theory by counting the amount of times words occur in positive reviews and in negative reviews.

In [5]:
import numpy as np
from collections import Counter

positive_word_count = Counter()
negative_word_count = Counter()
total_word_count = Counter()

In [6]:
### Counting the amount of times each word appears in positive reviews and in negative reviews.

for i in range(len(reviews)):
    if labels[i] == 'POSITIVE':
        for word in reviews[i].split(' '):
            positive_word_count[word] += 1
            total_word_count[word] += 1
    elif labels[i] == 'NEGATIVE':
        for word in reviews[i].split(' '):
            negative_word_count[word] += 1
            total_word_count[word] += 1
        

In [7]:
### Most common words in positive reviews ###
positive_word_count.most_common()

[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7456),
 ('some', 7441),
 ('if', 7285),
 ('just', 7152),
 ('can', 7001),
 ('story', 6780),
 ('time', 6515),
 ('my', 6488),
 ('g

In [8]:
### Most common words in negative reviews ###
negative_word_count.most_common()

[('', 561462),
 ('.', 167538),
 ('the', 163389),
 ('a', 79321),
 ('and', 74385),
 ('of', 69009),
 ('to', 68974),
 ('br', 52637),
 ('is', 50083),
 ('it', 48327),
 ('i', 46880),
 ('in', 43753),
 ('this', 40920),
 ('that', 37615),
 ('s', 31546),
 ('was', 26291),
 ('movie', 24965),
 ('for', 21927),
 ('but', 21781),
 ('with', 20878),
 ('as', 20625),
 ('t', 20361),
 ('film', 19218),
 ('you', 17549),
 ('on', 17192),
 ('not', 16354),
 ('have', 15144),
 ('are', 14623),
 ('be', 14541),
 ('he', 13856),
 ('one', 13134),
 ('they', 13011),
 ('at', 12279),
 ('his', 12147),
 ('all', 12036),
 ('so', 11463),
 ('like', 11238),
 ('there', 10775),
 ('just', 10619),
 ('by', 10549),
 ('or', 10272),
 ('an', 10266),
 ('who', 9969),
 ('from', 9731),
 ('if', 9518),
 ('about', 9061),
 ('out', 8979),
 ('what', 8422),
 ('some', 8306),
 ('no', 8143),
 ('her', 7947),
 ('even', 7687),
 ('can', 7653),
 ('has', 7604),
 ('good', 7423),
 ('bad', 7401),
 ('would', 7036),
 ('up', 6970),
 ('only', 6781),
 ('more', 6730),
 ('

In [9]:
### Most common words in all ###
total_word_count.most_common()

[('', 1111930),
 ('the', 336713),
 ('.', 327192),
 ('and', 164107),
 ('a', 163009),
 ('of', 145864),
 ('to', 135720),
 ('is', 107328),
 ('br', 101872),
 ('it', 96352),
 ('in', 93968),
 ('i', 87623),
 ('this', 76000),
 ('that', 73245),
 ('s', 65361),
 ('was', 48208),
 ('as', 46933),
 ('for', 44343),
 ('with', 44125),
 ('movie', 44039),
 ('but', 42603),
 ('film', 40155),
 ('you', 34230),
 ('on', 34200),
 ('t', 34081),
 ('not', 30626),
 ('he', 30138),
 ('are', 29430),
 ('his', 29374),
 ('have', 27731),
 ('be', 26957),
 ('one', 26789),
 ('all', 23978),
 ('at', 23513),
 ('they', 22906),
 ('by', 22546),
 ('an', 21560),
 ('who', 21433),
 ('so', 20617),
 ('from', 20498),
 ('like', 20276),
 ('there', 18832),
 ('her', 18421),
 ('or', 18004),
 ('just', 17771),
 ('about', 17374),
 ('out', 17113),
 ('if', 16803),
 ('has', 16790),
 ('what', 16159),
 ('some', 15747),
 ('good', 15143),
 ('can', 14654),
 ('more', 14251),
 ('she', 14223),
 ('when', 14182),
 ('very', 14069),
 ('up', 13291),
 ('time', 127

There are a lot of neutral words like "the", and "or". It's hard to validate the theory. What I'll do is calculate the ratios of how much  more a word is used in positive reviews then in negative reviews to really see if words like "excellent", and "amazing" are used more in positive reviews then negative reviews.

In [10]:
PN_ratios = Counter()

### Just taking words that are appear at least 100 times ###
common_words = []
for word,cnt in total_word_count.items():
    if cnt > 200:
        common_words.append(word)


for word in common_words:
    if negative_word_count[word] != 0: ### Making sure not to divide by zero ###
        PN_ratios[word] = positive_word_count[word] / float(negative_word_count[word])
    else:
        PN_ratios[word] = positive_word_count[word]


In [11]:
print("Positive-to-negative ratio for 'the' = {}".format(PN_ratios["the"]))
print("Positive-to-negative ratio for 'and' = {}".format(PN_ratios["and"]))
print("Positive-to-negative ratio for 'amazing' = {}".format(PN_ratios["amazing"]))
print("Positive-to-negative ratio for 'excellent' = {}".format(PN_ratios["excellent"]))
print("Poitives-to-negative ratio for 'terrible' = {}".format(PN_ratios["terrible"]))
print("Positive-to-negative ratio for 'bad' = {}".format(PN_ratios["bad"]))

Positive-to-negative ratio for 'the' = 1.0608058070004713
Positive-to-negative ratio for 'and' = 1.2061840424816832
Positive-to-negative ratio for 'amazing' = 4.038167938931298
Positive-to-negative ratio for 'excellent' = 4.337628865979381
Poitives-to-negative ratio for 'terrible' = 0.17757009345794392
Positive-to-negative ratio for 'bad' = 0.25766788271855157


Looking at these ratios: 
- Words like "excellent", that would be expected to show up more often in positive reviews have a ratio from 1 closer to infinity
- Words like "bad", that would be expected to show up more often in negative reviews have a ratio closer to 0
- Neutral words like "the" and "and", have a ratio close to 1


So we are starting to see that our theory of how the network can learn is correct. The amount of times specific words appears in a review is a very good way of telling if the review is positive or negative.

However, the ratios we got are not the best to work with because when comparing, the absolute value of the ratios of the words that appear more positive is much greater then the absolute value of those that appear more negative. Also when working with absolute values, it is much easier to center them around 0. Therefore I will convert all of these ratios to logarithms so that two words that are equally as negative as the other is positive will be the same distance from 0.  


In [12]:
for i in PN_ratios:
    PN_ratios[i] = np.log(PN_ratios[i])

In [13]:
print("Positive-to-negative ratio for 'the' = {}".format(PN_ratios["the"]))
print("Positive-to-negative ratio for 'and' = {}".format(PN_ratios["and"]))
print("Positive-to-negative ratio for 'amazing' = {}".format(PN_ratios["amazing"]))
print("Positive-to-negative ratio for 'excellent' = {}".format(PN_ratios["excellent"]))
print("Poitives-to-negative ratio for 'terrible' = {}".format(PN_ratios["terrible"]))
print("Positive-to-negative ratio for 'bad' = {}".format(PN_ratios["bad"]))

Positive-to-negative ratio for 'the' = 0.05902881460535952
Positive-to-negative ratio for 'and' = 0.18746169236813243
Positive-to-negative ratio for 'amazing' = 1.3957911086571477
Positive-to-negative ratio for 'excellent' = 1.4673278545675326
Poitives-to-negative ratio for 'terrible' = -1.7283898552954657
Positive-to-negative ratio for 'bad' = -1.3560837995970472


In [14]:
PN_ratios.most_common()

[('victoria', 2.7500144002012421),
 ('captures', 2.0794415416798357),
 ('wonderfully', 2.0485643031153966),
 ('powell', 2.0175661379617482),
 ('refreshing', 1.891548939836426),
 ('delightful', 1.8262456452992242),
 ('beautifully', 1.7784436932522829),
 ('underrated', 1.747956846569662),
 ('superb', 1.7189076208420597),
 ('welles', 1.693024628542366),
 ('sinatra', 1.6643145226599347),
 ('touching', 1.6514021115331325),
 ('stewart', 1.6249021381316819),
 ('brilliantly', 1.6191467265610613),
 ('friendship', 1.588384503236268),
 ('magnificent', 1.5686159179138452),
 ('jackie', 1.5686159179138452),
 ('wonderful', 1.5680329974659779),
 ('finest', 1.5668782980153044),
 ('freedom', 1.5321047090309301),
 ('terrific', 1.515408962785824),
 ('nancy', 1.5121746070088933),
 ('fantastic', 1.5117638297004306),
 ('noir', 1.5069971068796089),
 ('outstanding', 1.5040773967762742),
 ('marie', 1.5040773967762742),
 ('excellent', 1.4673278545675326),
 ('chan', 1.4484261422268967),
 ('gem', 1.407201045939204

In [15]:
list(reversed(PN_ratios.most_common()))

[('unfunny', -2.6882475738060303),
 ('waste', -2.6186484579840514),
 ('pointless', -2.4531579514734201),
 ('redeeming', -2.3648889763302003),
 ('lousy', -2.3025850929940455),
 ('worst', -2.2865847516476046),
 ('laughable', -2.2617630984737906),
 ('awful', -2.2265521924307397),
 ('poorly', -2.2192034840549946),
 ('sucks', -1.9830278120118159),
 ('lame', -1.9802348915963879),
 ('insult', -1.9730120788331045),
 ('horrible', -1.9093035277056216),
 ('amateurish', -1.9042374526547454),
 ('pathetic', -1.8979393212692839),
 ('wasted', -1.836211231798889),
 ('crap', -1.8270104186988547),
 ('tedious', -1.7971214123694403),
 ('dreadful', -1.7676619176489945),
 ('badly', -1.751858252475869),
 ('worse', -1.7364709637732161),
 ('terrible', -1.7283898552954657),
 ('embarrassing', -1.6969253665572162),
 ('mess', -1.6882490928583902),
 ('garbage', -1.6843392206072181),
 ('pile', -1.6625477377480489),
 ('stupid', -1.65455834771457),
 ('vampires', -1.6143041020852733),
 ('dull', -1.5831973397815833),
 ('

Checking out these ratios, we see that if a review contains words like "wonderfully", "refreshing", "delightful", and "beautifully"(the greatest numbers in PN_ratios), then it is most likely a positive review. We also see that if a review contains words like "unfunny", "waste", "lousy", "awful", "sucks", and "pathetic", then it is most probably a negative review. 

# Processing a review into input and creating our Neural Network

We will encapsulate the network in a class.

We need to create a method to input the reviews into our network, so what we'll do is:
- Create a vocabulary containing all of the words in all of the reviews. 
- Assign the vocabulary an index
- Create an input layer of zeros the same length as the vocabulary 
- Input a review by adding the amount of times each word is used in that review to the input layer of zeros at the indice of each word contained in the review

So this network object should be able to:
  - Take as input the number of hidden nodes and the learning rate and build the network 
  - Take as input reviews and labels to train on
  - Test the accuracy of the network against any given reviews
  - Predict the sentiment classification of any given movie review

This simple network will be built from scratch without using any deep learning libraries(tensorflow, keras..) only numpy. 


In [3]:
import time 
import sys
import numpy as np
from collections import Counter

class SentimentNetwork:
    def __init__(self, train_reviews, train_labels, hidden1_nodes = 10, hidden2_nodes = 10):
        np.random.seed(1)
        
        ### creating a network vocabulary with all of the words from all of the training reviews ###
        network_vocab = set()
        for review in train_reviews:
            for word in review.split(' '):
                network_vocab.add(word)
                
        self.vocab_index = {} ### creating an index for network vocab ###
        for index,word in enumerate(network_vocab):
            self.vocab_index[word] = index
        
        ### initializing weights ###
        self.weights_i_1 = np.zeros((len(network_vocab), hidden1_nodes))
        self.weights_1_2 = np.random.normal(0.0, hidden1_nodes**-0.5, (hidden1_nodes, hidden2_nodes))
        self.weights_2_o = np.random.normal(0.0, hidden2_nodes**-0.5, (hidden2_nodes, 1))
        
        self.train_reviews = train_reviews
        self.train_labels = train_labels
        
    ### sigmoid function ###
    def sigmoid(self,x):
        
        return 1/(1.0+np.exp(-x))
        
    ### derivative of sigmoid function ###
    def sigmoid_output_derivative(self,output):
        
        return output*(1-output)
        


    def train(self, learning_rate = 0.1, epochs = 15):
            
        assert(len(self.train_reviews) == len(self.train_labels))
        
        start = time.time()
            
        correct_so_far = 0
        
        ### process labels from "POSITIVE" and "NEGATIVE" to 1 and 0 ###
        for i in range(len(self.train_labels)):
            if self.train_labels[i] == "POSITIVE":
                self.train_labels[i] = 1
            else:
                self.train_labels[i] = 0
                
        
        ### create input layer ###
        layer_i = np.zeros((1, len(self.vocab_index)))
        
        ### looping through epochs ###
        for i in range(epochs):
            print(" ")
            print("\t Epoch #{}:".format(i+1))
            print("")
            for item in range(len(self.train_reviews)):
            
                ### get current review ###
                creview = self.train_reviews[item]
                clabel = self.train_labels[item]    
                    
                ### update input layer for current review ###
                layer_i *= 0 
                for word in creview.split(" "):
                    if word in self.vocab_index:
                        layer_i[0, self.vocab_index[word]] += 1
                
                ### FORWARD PASS ###
                layer_1 = np.dot(layer_i, self.weights_i_1)
                layer_2 = np.dot(layer_1, self.weights_1_2)
                output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
            
                ### BACKPROPOGATION ###
                output_error = clabel - output
            
                output_error_term = output_error * self.sigmoid_output_derivative(output) 
            
                hidden2_error_term = output_error_term.dot(self.weights_2_o.T) * 1 # just the derivative
            
                hidden1_error_term = hidden2_error_term.dot(self.weights_1_2.T) * 1 # just the derivative
            
                ### updating weights on every review ###
                self.weights_2_o += layer_2.T.dot(output_error_term) * learning_rate
                self.weights_1_2 += layer_1.T.dot(hidden2_error_term) * learning_rate
                self.weights_i_1 += layer_i.T.dot(hidden1_error_term) * learning_rate
                
                ### for training progress ###
                if abs(output_error) < 0.5:
                    correct_so_far += 1
                    
                elapsed_time = float(time.time() - start)
                progress = ((100.0 * (item+1+(i*len(self.train_reviews)))/(len(self.train_reviews)*epochs)))
                
                if progress % 5 == 0:
                    sys.stdout.write("\rProgress:" + str(progress)[:4] + "%" \
                                    + " | #Correct:" + str(correct_so_far) + "--> #Trained:" + str((item+1)+(i*len(self.train_reviews))) \
                                    + " | Training Accuracy:" + str(correct_so_far * 100.0 / (float(item+1)+(i*len(self.train_reviews))))[:4] + "%" \
                                    + " | Average Training speed per epoch: " + str(elapsed_time/(i+1))[:4] + " seconds")
                
    def run(self, review):
        
        ### create input layer ###
        layer_i = np.zeros((1, len(self.vocab_index))) 
        for word in review.split(" "):
            if word in self.vocab_index:    
                layer_i[0, self.vocab_index[word]] += 1
            
        ### FORWARD PASS ###
        layer_1 = np.dot(layer_i, self.weights_i_1)
        layer_2 = np.dot(layer_1, self.weights_1_2)
        output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
        
        if output >= 0.5:
            return "POSITIVE"
        else:
            return "NEGATIVE"
                
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0

        start = time.time()
        
        ### "self.run" each testing review through the network and count how many it got correct ###
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if (pred == testing_labels[i]):
                correct += 1
            
            
            ### print out accuracy and speed ###
            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            if i == (len(testing_reviews)-1):
                sys.stdout.write("\rProgress: " + str(100 * i/float(len(testing_reviews)-1))[:4] + "%" \
                                 + "\t#Correct: " + str(correct) + "\t#Tested:" + str(i+1))
        print("")
        print("Testing Accuracy: " + str(100*correct/len(testing_reviews)) + "%")

Since we have 25000 movie reviews to work with, the first 24000 will be used for training, and the last 1000 for testing 

In [4]:
network_v1 = SentimentNetwork(reviews[:-1000], labels[:-1000])

Next I'll run the training with a learning rate of 0.1 and 5 epochs:

In [5]:
network_v1.train(learning_rate = 0.1, epochs = 5)

 
	 Epoch #1:

Progress:20.0% | #Correct:15412--> #Trained:24000 | Training Accuracy:64.2% | Average Training speed per epoch: 110. seconds 
	 Epoch #2:

Progress:40.0% | #Correct:33705--> #Trained:48000 | Training Accuracy:70.2% | Average Training speed per epoch: 111. seconds 
	 Epoch #3:

Progress:60.0% | #Correct:52867--> #Trained:72000 | Training Accuracy:73.4% | Average Training speed per epoch: 111. seconds 
	 Epoch #4:

Progress:80.0% | #Correct:72440--> #Trained:96000 | Training Accuracy:75.4% | Average Training speed per epoch: 112. seconds 
	 Epoch #5:

Progress:100.% | #Correct:92324--> #Trained:120000 | Training Accuracy:76.9% | Average Training speed per epoch: 111. seconds

And Testing the network on the last 1000 reviews....

In [6]:
network_v1.test(reviews[-1000:], labels[-1000:])

Progress: 100.%	#Correct: 838	#Tested:1000
Testing Accuracy: 83.8%


# Network Efficiency

The network trained well but it is very slow. Maybe we can make the network train faster by reducing the size of the network vocabulary, we don't need uncommon words so much anyway.

I will rebuild and test the network with a common_cutoff varible that is the amount of times a word has to be found in all training reviews in order for the word to make it into the network vocabulary.

In [7]:
import time 
import sys
import numpy as np
from collections import Counter

class SentimentNetwork:
    def __init__(self, train_reviews, train_labels, hidden1_nodes = 10, hidden2_nodes = 10, common_cutoff = 50):
        np.random.seed(1)
        
        ### creating total word count for all training reviews ###
        word_count = Counter()
        for review in train_reviews:
            for word in review.split(' '):
                word_count[word] += 1
        network_vocab = set()                     ####       MORE EFFICIENT AND GENERAL         ####
        for word in word_count:                   ####   <--------------                        #### 
            if word_count[word] >= common_cutoff: #### selecting common words for network vocab ####
                network_vocab.add(word)
                
        self.vocab_index = {} ### creating an index for network vocab ###
        for index,word in enumerate(network_vocab):
            self.vocab_index[word] = index
        
        ### initializing weights ###
        self.weights_i_1 = np.zeros((len(network_vocab), hidden1_nodes))
        self.weights_1_2 = np.random.normal(0.0, hidden1_nodes**-0.5, (hidden1_nodes, hidden2_nodes))
        self.weights_2_o = np.random.normal(0.0, hidden2_nodes**-0.5, (hidden2_nodes, 1))
        
        self.train_reviews = train_reviews
        self.train_labels = train_labels
        
    ### sigmoid function ###
    def sigmoid(self,x):
        
        return 1/(1.0+np.exp(-x))
        
    ### derivative of sigmoid function ###
    def sigmoid_output_derivative(self,output):
        
        return output*(1-output)
        


    def train(self, learning_rate = 0.1, epochs = 15):
            
        assert(len(self.train_reviews) == len(self.train_labels))
        
        start = time.time()
            
        correct_so_far = 0
        
        ### process labels from "POSITIVE" and "NEGATIVE" to 1 and 0 ###
        for i in range(len(self.train_labels)):
            if self.train_labels[i] == "POSITIVE":
                self.train_labels[i] = 1
            else:
                self.train_labels[i] = 0
                
        
        ### create input layer ###
        layer_i = np.zeros((1, len(self.vocab_index)))
        
        ### looping through epochs ###
        for i in range(epochs):
            print(" ")
            print("Epoch #{}:".format(i+1))
            print("")
            for item in range(len(self.train_reviews)):
            
                ### get current review ###
                creview = self.train_reviews[item]
                clabel = self.train_labels[item]    
                    
                ### update input layer for current review ###
                layer_i *= 0 
                for word in creview.split(" "):                
                    if word in self.vocab_index:                 
                        layer_i[0, self.vocab_index[word]] += 1               
                
                ### FORWARD PASS ###
                layer_1 = np.dot(layer_i, self.weights_i_1)
                layer_2 = np.dot(layer_1, self.weights_1_2)
                output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
            
                ### BACKPROPOGATION ###
                output_error = clabel - output
            
                output_error_term = output_error * self.sigmoid_output_derivative(output) 
            
                hidden2_error_term = output_error_term.dot(self.weights_2_o.T) * 1 # just the derivative
            
                hidden1_error_term = hidden2_error_term.dot(self.weights_1_2.T) * 1 # just the derivative
            
                ### updating weights on every review ###
                self.weights_2_o += layer_2.T.dot(output_error_term) * learning_rate
                self.weights_1_2 += layer_1.T.dot(hidden2_error_term) * learning_rate
                self.weights_i_1 += layer_i.T.dot(hidden1_error_term) * learning_rate
                
                ### for training progress ###
                if abs(output_error) < 0.5:
                    correct_so_far += 1
                    
                elapsed_time = float(time.time() - start)
                progress = ((100.0 * (item+1+(i*len(self.train_reviews)))/(len(self.train_reviews)*epochs)))
                
                if progress % 5 == 0:
                    sys.stdout.write("\rProgress:" + str(progress)[:4] + "%" \
                                    + " | #Correct:" + str(correct_so_far) + "--> #Trained:" + str((item+1)+(i*len(self.train_reviews))) \
                                    + " | Training Accuracy:" + str(correct_so_far * 100.0 / (float(item+1)+(i*len(self.train_reviews))))[:4] + "%" \
                                    + " | Average Training speed per epoch: " + str(elapsed_time/(i+1))[:4] + " seconds")

                
    def run(self, review):
        
        ### create input layer ###
        layer_i = np.zeros((1, len(self.vocab_index))) 
        for word in review.split(" "):
            if word in self.vocab_index:    
                layer_i[0, self.vocab_index[word]] += 1
            
        ### FORWARD PASS ###
        layer_1 = np.dot(layer_i, self.weights_i_1)
        layer_2 = np.dot(layer_1, self.weights_1_2)
        output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
        
        if output >= 0.5:
            return "POSITIVE"
        else:
            return "NEGATIVE"
                
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0

        start = time.time()
        
        ### "self.run" each testing review through the network and count how many it got correct ###
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if (pred == testing_labels[i]):
                correct += 1
            
            
            ### print out accuracy and speed ###
            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            if i == (len(testing_reviews)-1):
                sys.stdout.write("\rProgress: " + str(100 * i/float(len(testing_reviews)-1))[:4] + "%" \
                                 + "\t#Correct: " + str(correct) + "\t#Tested:" + str(i+1))
        print("")
        print("Testing Accuracy: " + str(100*correct/len(testing_reviews)) + "%")

In [8]:
network_v2 = SentimentNetwork(reviews[:-1000], labels[:-1000], common_cutoff = 50)
network_v2.train(learning_rate = 0.1, epochs = 5)

 
Epoch #1:

Progress:20.0% | #Correct:15256--> #Trained:24000 | Training Accuracy:63.5% | Average Training speed per epoch: 10.1 seconds 
Epoch #2:

Progress:40.0% | #Correct:33541--> #Trained:48000 | Training Accuracy:69.8% | Average Training speed per epoch: 10.0 seconds 
Epoch #3:

Progress:60.0% | #Correct:52604--> #Trained:72000 | Training Accuracy:73.0% | Average Training speed per epoch: 10.0 seconds 
Epoch #4:

Progress:80.0% | #Correct:72101--> #Trained:96000 | Training Accuracy:75.1% | Average Training speed per epoch: 10.0 seconds 
Epoch #5:

Progress:100.% | #Correct:91813--> #Trained:120000 | Training Accuracy:76.5% | Average Training speed per epoch: 10.4 seconds

In [9]:
network_v2.test(reviews[-1000:], labels[-1000:])

Progress: 100.%	#Correct: 770	#Tested:1000
Testing Accuracy: 77.0%


Wow much faster, the size of our networks vocabulary has dramatically decreased making the input and training process a lot quicker with basically the same results. 



Now that we can train fast let's try to optimize the learning rate.
With a learning rate of 0.1 and 5 epochs, we got a 77.7% Testing accuracy

Let's see if making the learning rate 0.01 will give better results. I will bump up the epochs a little to compensate for the smaller learning rate. 

In [15]:
network_v2 = SentimentNetwork(reviews[:-1000], labels[:-1000], common_cutoff = 50)
network_v2.train(learning_rate = 0.01, epochs = 7)

 
Epoch #1:

Progress:10.0% | #Correct:10992--> #Trained:16800 | Training Accuracy:65.4% | Average Training speed per epoch: 6.99 seconds 
Epoch #2:

Progress:25.0% | #Correct:30108--> #Trained:42000 | Training Accuracy:71.6% | Average Training speed per epoch: 8.71 seconds 
Epoch #3:

Progress:40.0% | #Correct:50272--> #Trained:67200 | Training Accuracy:74.8% | Average Training speed per epoch: 9.29 seconds 
Epoch #4:

Progress:55.0% | #Correct:71011--> #Trained:92400 | Training Accuracy:76.8% | Average Training speed per epoch: 9.58 seconds 
Epoch #5:

Progress:70.0% | #Correct:92030--> #Trained:117600 | Training Accuracy:78.2% | Average Training speed per epoch: 9.76 seconds 
Epoch #6:

Progress:85.0% | #Correct:113316--> #Trained:142800 | Training Accuracy:79.3% | Average Training speed per epoch: 9.87 seconds 
Epoch #7:

Progress:100.% | #Correct:134823--> #Trained:168000 | Training Accuracy:80.2% | Average Training speed per epoch: 9.98 seconds

In [16]:
network_v2.test(reviews[-1000:], labels[-1000:])

Progress: 100.%	#Correct: 847	#Tested:1000
Testing Accuracy: 84.7%


84%, an increase. Let's see if making it even smaller to 0.001 will make a give better results.

In [17]:
network_v2 = SentimentNetwork(reviews[:-1000], labels[:-1000], common_cutoff = 50)
network_v2.train(learning_rate = 0.001, epochs = 7)

 
Epoch #1:

Progress:10.0% | #Correct:11001--> #Trained:16800 | Training Accuracy:65.4% | Average Training speed per epoch: 7.67 seconds 
Epoch #2:

Progress:25.0% | #Correct:30079--> #Trained:42000 | Training Accuracy:71.6% | Average Training speed per epoch: 9.14 seconds 
Epoch #3:

Progress:40.0% | #Correct:50204--> #Trained:67200 | Training Accuracy:74.7% | Average Training speed per epoch: 9.62 seconds 
Epoch #4:

Progress:55.0% | #Correct:70842--> #Trained:92400 | Training Accuracy:76.6% | Average Training speed per epoch: 9.87 seconds 
Epoch #5:

Progress:70.0% | #Correct:91836--> #Trained:117600 | Training Accuracy:78.0% | Average Training speed per epoch: 10.0 seconds 
Epoch #6:

Progress:85.0% | #Correct:113072--> #Trained:142800 | Training Accuracy:79.1% | Average Training speed per epoch: 10.1 seconds 
Epoch #7:

Progress:100.% | #Correct:134526--> #Trained:168000 | Training Accuracy:80.0% | Average Training speed per epoch: 10.1 seconds

In [18]:
network_v2.test(reviews[-1000:], labels[-1000:])

Progress: 100.%	#Correct: 824	#Tested:1000
Testing Accuracy: 82.4%


With an accuracy of 84.3%, it looks like a learning rate of 0.01, and 7 Epochs is the best option for now.


 

# Reducing Neural Noise

So far we have an accuracy of 84.3% which is pretty good. Maybe we can make our network able to perform better by reducing the amount of noise that goes through the network. Neural Noise is when the network does extra calculations with data that does not nesasarily help solve the problem at hand. 

Let's take a look at how we input a review into our network. 

Right now:
- We create an input layer the same size as our networks vocabulary 
- We set an index for that input layer for each word in the vocab
- for/at each indice:
  ----> input the number of times that word is used in the review  

Let's look at an example input from a review:

In [19]:
### review #1 as input ###
example_counter = Counter()
for word in reviews[0].split(" "):
    example_counter[word] += 1

example_counter.most_common()

[('.', 27),
 ('', 18),
 ('the', 9),
 ('to', 6),
 ('high', 5),
 ('i', 5),
 ('bromwell', 4),
 ('is', 4),
 ('a', 4),
 ('teachers', 4),
 ('that', 4),
 ('of', 4),
 ('it', 2),
 ('at', 2),
 ('as', 2),
 ('school', 2),
 ('my', 2),
 ('in', 2),
 ('me', 2),
 ('students', 2),
 ('their', 2),
 ('student', 2),
 ('cartoon', 1),
 ('comedy', 1),
 ('ran', 1),
 ('same', 1),
 ('time', 1),
 ('some', 1),
 ('other', 1),
 ('programs', 1),
 ('about', 1),
 ('life', 1),
 ('such', 1),
 ('years', 1),
 ('teaching', 1),
 ('profession', 1),
 ('lead', 1),
 ('believe', 1),
 ('s', 1),
 ('satire', 1),
 ('much', 1),
 ('closer', 1),
 ('reality', 1),
 ('than', 1),
 ('scramble', 1),
 ('survive', 1),
 ('financially', 1),
 ('insightful', 1),
 ('who', 1),
 ('can', 1),
 ('see', 1),
 ('right', 1),
 ('through', 1),
 ('pathetic', 1),
 ('pomp', 1),
 ('pettiness', 1),
 ('whole', 1),
 ('situation', 1),
 ('all', 1),
 ('remind', 1),
 ('schools', 1),
 ('knew', 1),
 ('and', 1),
 ('when', 1),
 ('saw', 1),
 ('episode', 1),
 ('which', 1),
 ('r

In [20]:
### review #4 as input ###
example_counter = Counter()
for word in reviews[3].split(" "):
    example_counter[word] += 1

example_counter.most_common()

[('', 167),
 ('the', 53),
 ('of', 23),
 ('.', 23),
 ('a', 22),
 ('is', 14),
 ('to', 13),
 ('airport', 12),
 ('i', 12),
 ('as', 11),
 ('with', 11),
 ('s', 11),
 ('it', 10),
 ('this', 9),
 ('in', 8),
 ('br', 8),
 ('plane', 7),
 ('t', 7),
 ('or', 7),
 ('out', 6),
 ('while', 6),
 ('but', 5),
 ('for', 5),
 ('not', 5),
 ('air', 4),
 ('time', 4),
 ('disaster', 4),
 ('three', 4),
 ('have', 4),
 ('one', 4),
 ('just', 4),
 ('if', 4),
 ('are', 4),
 ('that', 4),
 ('little', 4),
 ('much', 4),
 ('there', 4),
 ('scenes', 4),
 ('his', 3),
 ('also', 3),
 ('on', 3),
 ('by', 3),
 ('two', 3),
 ('they', 3),
 ('an', 3),
 ('films', 3),
 ('so', 3),
 ('has', 3),
 ('could', 3),
 ('great', 3),
 ('even', 3),
 ('film', 3),
 ('other', 3),
 ('oscar', 3),
 ('winner', 3),
 ('new', 2),
 ('luxury', 2),
 ('up', 2),
 ('valuable', 2),
 ('stevens', 2),
 ('james', 2),
 ('stewart', 2),
 ('who', 2),
 ('flying', 2),
 ('opened', 2),
 ('takes', 2),
 ('off', 2),
 ('mid', 2),
 ('hi', 2),
 ('chambers', 2),
 ('oil', 2),
 ('rig', 2),


Looking at these example input's, I notice that neutral words like "the", "of", "with", "this", and "as", appear in each review often, putting high numbers for those inputs. Their numbers also vary with every review and being neutral words, don't contribute at all to prediciting the sentiment. This means our network will have to unnecessarily learn how to accomodate to this extra noise. 


The network will train faster and might even learn better if instead of inputting the number of times a word is used in a review, we only input a '1' or '0', wether or not the word occured in the review.


From now on we will:
- Create an input layer the same size as our networks vocabulary 
- Set an index for that input layer for each word in the vocab
- for/at each indice:
  ----> input a '1' if the word occurs in the review and a '0' if it doesn't
  
  
Let's recreate our SentimentNetwork class this time with the ^new input method above.

In [None]:
import time 
import sys
import numpy as np
from collections import Counter

class SentimentNetwork:
    def __init__(self, train_reviews, train_labels, hidden1_nodes = 10, hidden2_nodes = 10, common_cutoff = 50):
        np.random.seed(1)
        
        ### creating total word count for all training reviews ###
        word_count = Counter()
        for review in train_reviews:
            for word in review.split(' '):
                word_count[word] += 1
        network_vocab = set()                     ####       MORE EFFICIENT AND GENERAL         ####
        for word in word_count:                   ####   <--------------                        #### 
            if word_count[word] >= common_cutoff: #### selecting common words for network vocab ####
                network_vocab.add(word)
                
        self.vocab_index = {} ### creating an index for network vocab ###
        for index,word in enumerate(network_vocab):
            self.vocab_index[word] = index
        
        ### initializing weights ###
        self.weights_i_1 = np.zeros((len(network_vocab), hidden1_nodes))
        self.weights_1_2 = np.random.normal(0.0, hidden1_nodes**-0.5, (hidden1_nodes, hidden2_nodes))
        self.weights_2_o = np.random.normal(0.0, hidden2_nodes**-0.5, (hidden2_nodes, 1))
        
        self.train_reviews = train_reviews
        self.train_labels = train_labels
        
    ### sigmoid function ###
    def sigmoid(self,x):
        
        return 1/(1.0+np.exp(-x))
        
    ### derivative of sigmoid function ###
    def sigmoid_output_derivative(self,output):
        
        return output*(1-output)
        


    def train(self, learning_rate = 0.1, epochs = 15):
            
        assert(len(self.train_reviews) == len(self.train_labels))
        
        start = time.time()
            
        correct_so_far = 0
        
        ### process labels from "POSITIVE" and "NEGATIVE" to 1 and 0 ###
        for i in range(len(self.train_labels)):
            if self.train_labels[i] == "POSITIVE":
                self.train_labels[i] = 1
            else:
                self.train_labels[i] = 0
                
        
        ### create input layer ###
        layer_i = np.zeros((1, len(self.vocab_index)))
        
        ### looping through epochs ###
        for i in range(epochs):
            print(" ")
            print("Epoch #{}:".format(i+1))
            print("")
            for item in range(len(self.train_reviews)):
            
                ### get current review ###
                creview = self.train_reviews[item]
                clabel = self.train_labels[item]    
                    
                ### update input layer for current review ###
                layer_i *= 0 
                for word in creview.split(" "):                ####        REDUCING NOISE        ####
                    if word in self.vocab_index:               #### from: "layer_i[...] *+=* 1", ####  
                        layer_i[0, self.vocab_index[word]] = 1 #### to ---> "layer_i[...] *=* 1" ####                
                
                ### FORWARD PASS ###
                layer_1 = np.dot(layer_i, self.weights_i_1)
                layer_2 = np.dot(layer_1, self.weights_1_2)
                output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
            
                ### BACKPROPOGATION ###
                output_error = clabel - output
            
                output_error_term = output_error * self.sigmoid_output_derivative(output) 
            
                hidden2_error_term = output_error_term.dot(self.weights_2_o.T) * 1 # just the derivative
            
                hidden1_error_term = hidden2_error_term.dot(self.weights_1_2.T) * 1 # just the derivative
            
                ### updating weights on every review ###
                self.weights_2_o += layer_2.T.dot(output_error_term) * learning_rate
                self.weights_1_2 += layer_1.T.dot(hidden2_error_term) * learning_rate
                self.weights_i_1 += layer_i.T.dot(hidden1_error_term) * learning_rate
                
                ### for training progress ###
                if abs(output_error) < 0.5:
                    correct_so_far += 1
                    
                elapsed_time = float(time.time() - start)
                progress = ((100.0 * (item+1+(i*len(self.train_reviews)))/(len(self.train_reviews)*epochs)))
                
                if progress % 5 == 0:
                    sys.stdout.write("\rProgress:" + str(progress)[:4] + "%" \
                                    + " | #Correct:" + str(correct_so_far) + "--> #Trained:" + str((item+1)+(i*len(self.train_reviews))) \
                                    + " | Training Accuracy:" + str(correct_so_far * 100.0 / (float(item+1)+(i*len(self.train_reviews))))[:4] + "%" \
                                    + " | Average Training speed per epoch: " + str(elapsed_time/(i+1))[:4] + " seconds")

                
    def run(self, review):
        
        ### create input layer ###
        layer_i = np.zeros((1, len(self.vocab_index))) 
        for word in review.split(" "):                 ####        REDUCING NOISE        ####
            if word in self.vocab_index:               #### from: "layer_i[...] *+=* 1", ####
                layer_i[0, self.vocab_index[word]] = 1 #### to ---> "layer_i[...] *=* 1" ####
            
        ### FORWARD PASS ###
        layer_1 = np.dot(layer_i, self.weights_i_1)
        layer_2 = np.dot(layer_1, self.weights_1_2)
        output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
        
        if output >= 0.5:
            return "POSITIVE"
        else:
            return "NEGATIVE"
                
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0

        start = time.time()
        
        ### "self.run" each testing review through the network and count how many it got correct ###
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if (pred == testing_labels[i]):
                correct += 1
            
            
            ### print out accuracy and speed ###
            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            if i == (len(testing_reviews)-1):
                sys.stdout.write("\rProgress: " + str(100 * i/float(len(testing_reviews)-1))[:4] + "%" \
                                 + "\t#Correct: " + str(correct) + "\t#Tested:" + str(i+1))
        print("")
        print("Testing Accuracy: " + str(100*correct/len(testing_reviews)) + "%")

And test it...

Since the network is only dealing with numbers no greater than 1, we can reduce the number of Epochs when we train the network because the network does not have to learn to accomodate to large numbers of neutral words.  



In [22]:
network_v3 = SentimentNetwork(reviews[:-1000], labels[:-1000], common_cutoff = 50)
network_v3.train(learning_rate = 0.01, epochs = 3)
network_v3.test(reviews[-1000:], labels[-1000:])

 
Epoch #1:

Progress:30.0% | #Correct:18157--> #Trained:21600 | Training Accuracy:84.0% | Average Training speed per epoch: 11.4 seconds 
Epoch #2:

Progress:65.0% | #Correct:40607--> #Trained:46800 | Training Accuracy:86.7% | Average Training speed per epoch: 10.3 seconds 
Epoch #3:

Progress: 100.%	#Correct: 856	#Tested:1000d:72000 | Training Accuracy:88.3% | Average Training speed per epoch: 9.71 seconds
Testing Accuracy: 85.6%


This network has achieved an 85.6% Testing accuracy. with the added efficiency

# Further Network Efficiency

Let's take a look at the Forward Pass of the network training. 

1. The network is creating a input layer(layer_i), the same size as the network vocab
2. The network looks up the vocab index of each word in the review and adds 1 by that index in layer_i, every time that vocab word appears in the review
3. The network takes the dot product of layer_i with the weights_i_1 to get the value of layer_1
4. The network takes the dot product of layer_1 with the weights_1_2 to get the value of layer_2
5. Finally the network takes the dot product of layer_2 with the weights_2_o and gets a single final value


Maybe we can get rid of a couple steps.

What if we preprocessed all of the words in the training reviews to indices from the vocab_index. That way, to input a review, we can start off with a layer_1 of zeros and for every indice in the review, add all the weights for that indice (from weights_i_1) to layer_1 however many times that word appears in the review.

After replacing all of the words in the training reviews with indices from vocab_index before the training, our network can run more efficiently by changing the Forward Pass from the steps above^ to the steps below:

1. Create layer_1 the size of the amount of hidden nodes you want 
2. For every indice in a review, ----> add all of the weights from that indice in weights_i_1 to layer_1 however many times that word appears in the review
3. The network then takes the dot product of layer_1 with the weights_1_2 to get the value of layer_2
4. Finally the network takes the dot product of layer_2 with the weights_2_o and gets a single final value



Let's rebuild SentimentNetwork with the new Forward Pass 

In [23]:
import time 
import sys
import numpy as np
from collections import Counter

class SentimentNetwork:
    def __init__(self, train_reviews, train_labels, hidden1_nodes = 10, hidden2_nodes = 10, common_cutoff = 50):
        np.random.seed(1)
        
        ### creating total word count for all training reviews ###
        word_count = Counter()
        for review in train_reviews:
            for word in review.split(' '):
                word_count[word] += 1
        network_vocab = set() #### MORE EFFICIENT AND GENERAL- selecting common words for network vocab ####
        for word in word_count:
            if word_count[word] >= common_cutoff:
                network_vocab.add(word)
                
        self.vocab_index = {} ### creating an index for network vocab ###
        for index,word in enumerate(network_vocab):
            self.vocab_index[word] = index
        
        ### initializing weights ###
        self.weights_i_1 = np.zeros((len(network_vocab), hidden1_nodes))
        self.weights_1_2 = np.random.normal(0.0, hidden1_nodes**-0.5, (hidden1_nodes, hidden2_nodes))
        self.weights_2_o = np.random.normal(0.0, hidden2_nodes**-0.5, (hidden2_nodes, 1))
        
        self.train_reviews = train_reviews
        self.train_labels = train_labels
        
    ### sigmoid function ###
    def sigmoid(self,x):
        
        return 1/(1.0+np.exp(-x))
        
    ### derivative of sigmoid function ###
    def sigmoid_output_derivative(self,output):
        
        return output*(1-output)
        


    def train(self, learning_rate = 0.1, epochs = 15):
            
        assert(len(self.train_reviews) == len(self.train_labels))
        
        start = time.time()
            
        correct_so_far = 0
        
        ### processing labels from "POSITIVE" and "NEGATIVE" to 1 and 0 ###
        for i in range(len(self.train_labels)):
            if self.train_labels[i] == "POSITIVE":
                self.train_labels[i] = 1
            else:
                self.train_labels[i] = 0
        
        #### processing reviews from words into indices from vocab_index ### 
        indexed_reviews = list()                                         
        for review in self.train_reviews:                         ####          FURTHER EFFICIENCY            ####
            indexed_review = set()                                ####                                        #### 
            for word in review.split(" "):                        #### after converting reviews to:           #### 
                if word in self.vocab_index:                      #### <----  set(indices)                    ####            #### 
                    indexed_review.add(self.vocab_index[word])    ####                                        ####
            indexed_reviews.append(list(indexed_review))          ####                                        ####
                                                                  ####                                        ####
                                                                  ####                                        ####
        ### creating layer_1 ###                                  ####     <---- create layer_1               ####
        layer_1 = np.zeros((1, len(self.weights_i_1[0])))         ####                                        ####
                                                                  ####                                        #### 
        ### looping through epochs ###                            ####                                        ####
        for i in range(epochs):                                   ####                                        ####
            print(" ")                                            ####                                        ####
            print("Epoch #{}:".format(i+1))                       ####                                        ####
            print("")                                             ####                                        ####
            for item in range(len(self.train_reviews)):           ####                                        ####
                                                                  ####                                        ####
                ### get current review ###                        ####                                        ####
                creview = indexed_reviews[item]                   ####                                        ####
                clabel = self.train_labels[item]                  ####                                        ####
                                                                  ####                                        ####
                ### update *layer_1* for current review ###       #### For every indice in an indiced review: ####                
                layer_1 *= 0                                      ####                                        ####
                for indx in creview:                              #### take the weights at that indice from   ####                      
                    layer_1 += self.weights_i_1[indx]             #### weights_i_1, and add them to layer_1   ####
                
                ### FORWARD PASS ###
                layer_2 = np.dot(layer_1, self.weights_1_2)
                output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
            
                ### BACKPROPOGATION ###
                output_error = clabel - output
            
                output_error_term = output_error * self.sigmoid_output_derivative(output) 
            
                hidden2_error_term = output_error_term.dot(self.weights_2_o.T) * 1 # just the derivative
            
                hidden1_error_term = hidden2_error_term.dot(self.weights_1_2.T) * 1 # just the derivative
                                                                  
                ### updating weights on every review ###
                self.weights_2_o += layer_2.T.dot(output_error_term) * learning_rate
                self.weights_1_2 += layer_1.T.dot(hidden2_error_term) * learning_rate
                                                                  ####         
                for indx in creview:                         ####    new way of updating weights for    ####
                    self.weights_i_1[indx] += hidden1_error_term[0] * learning_rate #### <---- weights_i_1  ####

                
                ### for training progress ###
                if abs(output_error) < 0.5:
                    correct_so_far += 1
                    
                elapsed_time = float(time.time() - start)
                progress = ((100.0 * (item+1+(i*len(self.train_reviews)))/(len(self.train_reviews)*epochs)))
                
                if progress % 5 == 0:
                    sys.stdout.write("\rProgress:" + str(progress)[:4] + "%" \
                                    + " | #Correct:" + str(correct_so_far) + "--> #Trained:" + str((item+1)+(i*len(self.train_reviews))) \
                                    + " | Training Accuracy:" + str(correct_so_far * 100.0 / (float(item+1)+(i*len(self.train_reviews))))[:4] + "%" \
                                    + " | Average Training speed per epoch: " + str(elapsed_time/(i+1))[:4] + " seconds")

                
    def run(self, review):
        
        indexed_review = set()                               ####  ....FURTHER EFFICIENCY  ####
        for word in review.split(" "):                    ####                          ####  
            if word in self.vocab_index:                  ####                          ####
                indexed_review.add(self.vocab_index[word])#                          #### 
                                                          ####                          ####
        ### create layer_1 ###                            ####                          ####
        layer_1 = np.zeros((1, len(self.weights_i_1[0]))) ####   <---------             #### 
        for indice in list(indexed_review):
            layer_1 += self.weights_i_1[indice]
            
        ### FORWARD PASS ###
        
        layer_2 = np.dot(layer_1, self.weights_1_2)
        output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
        
        if output >= 0.5:
            return "POSITIVE"
        else:
            return "NEGATIVE"
                
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0

        start = time.time()
        
        ### "self.run" each testing review through the network and count how many it got correct ###
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if (pred == testing_labels[i]):
                correct += 1
            
            
            ### print out accuracy and speed ###
            elapsed_time = float(time.time() - start)
           
            if i == (len(testing_reviews)-1):
                sys.stdout.write("\r\nProgress: " + str(100 * i/float(len(testing_reviews)-1))[:4] + "%" \
                                 + "\t#Correct: " + str(correct) + "\t#Tested:" + str(i+1))
        print("")
        print("Testing Accuracy: " + str(100*correct/len(testing_reviews)) + "%")

In [24]:
network_v3 = SentimentNetwork(reviews[:-1000], labels[:-1000])
network_v3.train(learning_rate = 0.01, epochs = 3)
network_v3.test(reviews[-1000:], labels[-1000:])

 
Epoch #1:

Progress:30.0% | #Correct:18157--> #Trained:21600 | Training Accuracy:84.0% | Average Training speed per epoch: 14.6 seconds 
Epoch #2:

Progress:65.0% | #Correct:40607--> #Trained:46800 | Training Accuracy:86.7% | Average Training speed per epoch: 12.9 seconds 
Epoch #3:

Progress:100.% | #Correct:63602--> #Trained:72000 | Training Accuracy:88.3% | Average Training speed per epoch: 12.3 seconds
Progress: 100.%	#Correct: 856	#Tested:1000
Testing Accuracy: 85.6%


# Further Noise Reduction

Let's try to see if we can optimize our network even more to solve our problem.

So let's go back to how we were first framing our problem.

The theory was that the network should learn which key words should matter when predicting the sentiment. We want the network to learn that words like "horrible", and "awful", should very much lean the predictions towards negative while words like "the" or "what" or "a", should not matter at all because they are neutral words. We want our network to recognize the sentiment of specifically positive and specifically negative words while learning to not give neutral words any value.


Maybe taking the neutral words out of our network vocabulary will allow the network to learn better because now it does not have to accomodate to as much noise?


So let's go back to our positive to negative word ratios to try and validate this theory.
Let's first take a look at the words with the most positive sentiments and words with the most negative sentiments and try to see at which P/N-ratio does the sentiment of the words start not mattering anymore. 

(Remember "PN_ratios" is a dictionary that contains a ratio for every word found in the training reviews of how much a  word appears in reviews labeled positive over how much that word appears in a review labeled negative)

In [19]:
PN_ratios.most_common()

[('victoria', 2.7500144002012421),
 ('captures', 2.0794415416798357),
 ('wonderfully', 2.0485643031153966),
 ('powell', 2.0175661379617482),
 ('refreshing', 1.891548939836426),
 ('delightful', 1.8262456452992242),
 ('beautifully', 1.7784436932522829),
 ('underrated', 1.747956846569662),
 ('superb', 1.7189076208420597),
 ('welles', 1.693024628542366),
 ('sinatra', 1.6643145226599347),
 ('touching', 1.6514021115331325),
 ('stewart', 1.6249021381316819),
 ('brilliantly', 1.6191467265610613),
 ('friendship', 1.588384503236268),
 ('magnificent', 1.5686159179138452),
 ('jackie', 1.5686159179138452),
 ('wonderful', 1.5680329974659779),
 ('finest', 1.5668782980153044),
 ('freedom', 1.5321047090309301),
 ('terrific', 1.515408962785824),
 ('nancy', 1.5121746070088933),
 ('fantastic', 1.5117638297004306),
 ('noir', 1.5069971068796089),
 ('outstanding', 1.5040773967762742),
 ('marie', 1.5040773967762742),
 ('excellent', 1.4673278545675326),
 ('chan', 1.4484261422268967),
 ('gem', 1.407201045939204

In [20]:
list(reversed(PN_ratios.most_common()))

[('unfunny', -2.6882475738060303),
 ('waste', -2.6186484579840514),
 ('pointless', -2.4531579514734201),
 ('redeeming', -2.3648889763302003),
 ('lousy', -2.3025850929940455),
 ('worst', -2.2865847516476046),
 ('laughable', -2.2617630984737906),
 ('awful', -2.2265521924307397),
 ('poorly', -2.2192034840549946),
 ('sucks', -1.9830278120118159),
 ('lame', -1.9802348915963879),
 ('insult', -1.9730120788331045),
 ('horrible', -1.9093035277056216),
 ('amateurish', -1.9042374526547454),
 ('pathetic', -1.8979393212692839),
 ('wasted', -1.836211231798889),
 ('crap', -1.8270104186988547),
 ('tedious', -1.7971214123694403),
 ('dreadful', -1.7676619176489945),
 ('badly', -1.751858252475869),
 ('worse', -1.7364709637732161),
 ('terrible', -1.7283898552954657),
 ('embarrassing', -1.6969253665572162),
 ('mess', -1.6882490928583902),
 ('garbage', -1.6843392206072181),
 ('pile', -1.6625477377480489),
 ('stupid', -1.65455834771457),
 ('vampires', -1.6143041020852733),
 ('dull', -1.5831973397815833),
 ('

Looking at the positive ratios, I see that at around a ratio of 0.35--->0, the words start getting more and more neutral. Same goes for the negative ratios, at around -0.35--->0, the words start getting more neutral.

The best way to really see where the neutral words are all located is by plotting these ratios out. 

In [21]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [18]:
hist, edges = np.histogram(list(map(lambda x:x[1],PN_ratios.most_common())), density=True, bins=100, normed=False)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Positive to Negative Affinity Distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

In the plot we can see that there is this huge chunk of neutral words that appear way more often then words with a stronger sentiment. 

Let's try to optimize our network by creating a new argument for our network, "polarity cutoff". From now on, a word will only get added to our vocabulary if the absolute value of its positive to negative ratio is greater then the value of polarity cutoff.

In [40]:
import time 
import sys
import numpy as np
from collections import Counter

class SentimentNetwork:
    def __init__(self, train_reviews, train_labels, hidden1_nodes = 10, hidden2_nodes = 10, common_cutoff = 50, polarity_cutoff = 0.5):
        np.random.seed(1)
        
        #### FURTHER NOISE REDUCTION ####
        ####   CREATING PN_ratios    ####
        P_count = Counter()
        N_count = Counter()
        
        total_word_count = Counter()
        for i in range(len(train_reviews)):
            for word in train_reviews[i].split(' '):
                if train_labels[i] == 'POSITIVE':       
                    P_count[word] += 1
                    total_word_count[word] += 1
                if train_labels[i] == 'NEGATIVE':
                    N_count[word] += 1
                    total_word_count[word] += 1
                                                              ####        : FURTHER NOISE REDUCTION :          ####
                                                              ####                                             #### 
        network_vocab = set()                                 ####  Only adding words to the networks vocab if ####
        for word in total_word_count:                         #### the absolute value of P/N ratio is greater  #### 
            ratio = float                                     ####   than "polarity_cutoff" AND, if the word   ####  
            if N_count[word] != 0:                            #### appears in all of the training reviews more ####    
                ratio = P_count[word] / float(N_count[word])  #### "than common_cutoff" times                  ####
            elif N_count[word] == 0:                                             
                ratio = P_count[word]
            
            #     # In order to add word to vocab it has to have a total count greater #      #      
            # than "common_cutoff", and a P/N ratio farther than "polarity_cutoff" from zero. #
            if total_word_count[word] >= common_cutoff and abs(ratio) >= polarity_cutoff: #### selecting common words for network vocab ####
                network_vocab.add(word)
                
        self.vocab_index = {} ### creating an index for network vocab ###
        for index,word in enumerate(network_vocab):
            self.vocab_index[word] = index
        
        ### initializing weights ###
        self.weights_i_1 = np.zeros((len(network_vocab), hidden1_nodes))
        self.weights_1_2 = np.random.normal(0.0, hidden1_nodes**-0.5, (hidden1_nodes, hidden2_nodes))
        self.weights_2_o = np.random.normal(0.0, hidden2_nodes**-0.5, (hidden2_nodes, 1))
        
        self.train_reviews = train_reviews
        self.train_labels = train_labels
        
    ### sigmoid function ###
    def sigmoid(self,x):
        
        return 1/(1.0+np.exp(-x))
        
    ### derivative of sigmoid function ###
    def sigmoid_output_derivative(self,output):
        
        return output*(1-output)
        


    def train(self, learning_rate = 0.1, epochs = 15):
            
        assert(len(self.train_reviews) == len(self.train_labels))
        
        start = time.time()
            
        correct_so_far = 0
        
        ### process labels from "POSITIVE" and "NEGATIVE" to 1 and 0 ###
        for i in range(len(self.train_labels)):
            if self.train_labels[i] == "POSITIVE":
                self.train_labels[i] = 1
            else:
                self.train_labels[i] = 0
                
        
        ### create input layer ###
        layer_i = np.zeros((1, len(self.vocab_index)))
        
        ### looping through epochs ###
        for i in range(epochs):
            print(" ")
            print("Epoch #{}:".format(i+1))
            print("")
            for item in range(len(self.train_reviews)):
            
                ### get current review ###
                creview = self.train_reviews[item]
                clabel = self.train_labels[item]    
                    
                ### update input layer for current review ###
                layer_i *= 0 
                for word in creview.split(" "):                ####        REDUCING NOISE        ####
                    if word in self.vocab_index:               #### from: "layer_i[...] *+=* 1", ####  
                        layer_i[0, self.vocab_index[word]] = 1 #### to ---> "layer_i[...] *=* 1" ####                
                
                ### FORWARD PASS ###
                layer_1 = np.dot(layer_i, self.weights_i_1)
                layer_2 = np.dot(layer_1, self.weights_1_2)
                output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
            
                ### BACKPROPOGATION ###
                output_error = clabel - output
            
                output_error_term = output_error * self.sigmoid_output_derivative(output) 
            
                hidden2_error_term = output_error_term.dot(self.weights_2_o.T) * 1 # just the derivative
            
                hidden1_error_term = hidden2_error_term.dot(self.weights_1_2.T) * 1 # just the derivative
            
                ### updating weights on every review ###
                self.weights_2_o += layer_2.T.dot(output_error_term) * learning_rate
                self.weights_1_2 += layer_1.T.dot(hidden2_error_term) * learning_rate
                self.weights_i_1 += layer_i.T.dot(hidden1_error_term) * learning_rate
                
                ### for training progress ###
                if abs(output_error) < 0.5:
                    correct_so_far += 1
                    
                elapsed_time = float(time.time() - start)
                progress = ((100.0 * (item+1+(i*len(self.train_reviews)))/(len(self.train_reviews)*epochs)))
                
                if progress % 5 == 0:
                    sys.stdout.write("\rProgress:" + str(progress)[:4] + "%" \
                                    + " | #Correct:" + str(correct_so_far) + "--> #Trained:" + str((item+1)+(i*len(self.train_reviews))) \
                                    + " | Training Accuracy:" + str(correct_so_far * 100.0 / (float(item+1)+(i*len(self.train_reviews))))[:4] + "%" \
                                    + " | Average Training speed per epoch: " + str(elapsed_time/(i+1))[:4] + " seconds")

                
    def run(self, review):
        
        ### create input layer ###
        layer_i = np.zeros((1, len(self.vocab_index))) 
        for word in review.split(" "):                 ####        REDUCING NOISE        ####
            if word in self.vocab_index:               #### from: "layer_i[...] *+=* 1", ####
                layer_i[0, self.vocab_index[word]] = 1 #### to ---> "layer_i[...] *=* 1" ####
            
        ### FORWARD PASS ###
        layer_1 = np.dot(layer_i, self.weights_i_1)
        layer_2 = np.dot(layer_1, self.weights_1_2)
        output = self.sigmoid(np.dot(layer_2, self.weights_2_o))
        
        if output >= 0.5:
            return "POSITIVE"
        else:
            return "NEGATIVE"
                
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0

        start = time.time()
        
        ### "self.run" each testing review through the network and count how many it got correct ###
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if (pred == testing_labels[i]):
                correct += 1
            
            
            ### print out accuracy and speed ###
            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            if i == (len(testing_reviews)-1):
                sys.stdout.write("\rProgress: " + str(100 * i/float(len(testing_reviews)-1))[:4] + "%" \
                                 + "\t#Correct: " + str(correct) + "\t#Tested:" + str(i+1))
        print("")
        print("Testing Accuracy: " + str(100*correct/len(testing_reviews)) + "%")

In [55]:
network_v4 = SentimentNetwork(reviews[:-1000], labels[:-1000], common_cutoff = 50, polarity_cutoff = 0.1)
network_v4.train(learning_rate = 0.001, epochs = 7)
network_v4.test(reviews[-1000:], labels[-1000:])

 
Epoch #1:

Progress:10.0% | #Correct:12823--> #Trained:16800 | Training Accuracy:76.3% | Average Training speed per epoch: 5.91 seconds 
Epoch #2:

Progress:25.0% | #Correct:33644--> #Trained:42000 | Training Accuracy:80.1% | Average Training speed per epoch: 7.23 seconds 
Epoch #3:

Progress:40.0% | #Correct:55413--> #Trained:67200 | Training Accuracy:82.4% | Average Training speed per epoch: 7.72 seconds 
Epoch #4:

Progress:55.0% | #Correct:77621--> #Trained:92400 | Training Accuracy:84.0% | Average Training speed per epoch: 7.92 seconds 
Epoch #5:

Progress:70.0% | #Correct:100148--> #Trained:117600 | Training Accuracy:85.1% | Average Training speed per epoch: 8.06 seconds 
Epoch #6:

Progress:85.0% | #Correct:122929--> #Trained:142800 | Training Accuracy:86.0% | Average Training speed per epoch: 8.13 seconds 
Epoch #7:

Progress: 100.%	#Correct: 854	#Tested:1000ed:168000 | Training Accuracy:86.8% | Average Training speed per epoch: 8.18 seconds
Testing Accuracy: 85.4%


In [56]:
network_v4.train(learning_rate = 0.001, epochs = 1)

 
Epoch #1:

Progress:100.% | #Correct:23930--> #Trained:24000 | Training Accuracy:99.7% | Average Training speed per epoch: 8.84 seconds

In [57]:
network_v4.test(reviews[-1000:], labels[-1000:])

Progress: 100.%	#Correct: 500	#Tested:1000
Testing Accuracy: 50.0%
