<a href="https://colab.research.google.com/github/infiniteoverflow/Sentiment-Analysis-using-Neural-Networks/blob/master/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Curating the Dataset

In [0]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('/reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('/labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [3]:
len(reviews)

25000

In [4]:
reviews[1]

'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  '

In [5]:
labels[1]

'NEGATIVE'

# Exploring the dataset

In [6]:
print('label\t\t: \t review\n')

pretty_print_review_and_label(2241)
pretty_print_review_and_label(1871)
pretty_print_review_and_label(6)
pretty_print_review_and_label(18)
pretty_print_review_and_label(24)

label		: 	 review

NEGATIVE	:	i bought this at tower records after seeing the info  mercial about fifteen hund...
NEGATIVE	:	noting the cast  i recently watched this movie on tcm  hoping for an under  appr...
POSITIVE	:	this is easily the most underrated film inn the brooks cannon . sure  its flawed...
POSITIVE	:	you know  robin williams  god bless him  is constantly shooting himself in the f...
POSITIVE	:	there are many illnesses born in the mind of man which have been given life in m...


# Developing a Predictive theory

We will now count the occurances of each word in both POSITIVE and NEGATIVE reviews

In [0]:
from collections import Counter

In [0]:
positive_words = Counter()
negative_words = Counter()
total_words = Counter()

In [0]:
for i in range(len(reviews)):
  if(labels[i] == 'POSITIVE'):
    for word in reviews[i].split(" "):
      positive_words[word] += 1
      total_words[word] += 1
  else:
    for word in reviews[i].split(" "):
      negative_words[word] += 1
      total_words[word] += 1

In [10]:
positive_words.most_common()[0:30]

[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655)]

In [11]:
negative_words.most_common()[0:30]

[('', 561462),
 ('.', 167538),
 ('the', 163389),
 ('a', 79321),
 ('and', 74385),
 ('of', 69009),
 ('to', 68974),
 ('br', 52637),
 ('is', 50083),
 ('it', 48327),
 ('i', 46880),
 ('in', 43753),
 ('this', 40920),
 ('that', 37615),
 ('s', 31546),
 ('was', 26291),
 ('movie', 24965),
 ('for', 21927),
 ('but', 21781),
 ('with', 20878),
 ('as', 20625),
 ('t', 20361),
 ('film', 19218),
 ('you', 17549),
 ('on', 17192),
 ('not', 16354),
 ('have', 15144),
 ('are', 14623),
 ('be', 14541),
 ('he', 13856)]

We are not able to derive any intuition from this data :(

Soo we use postive to negative ratios to determine the correlation between the words in the review and its corresponding label

In [0]:
import numpy as np

positive_negative_ratio = Counter()

for term,cnt in list(total_words.most_common()):
  if cnt>10:
    positive_negative_rat = positive_words[term]/float(negative_words[term]+1)
    positive_negative_ratio[term] = positive_negative_rat

for word,ratio in list(positive_negative_ratio.most_common()):
  if ratio > 1:
    positive_negative_ratio[word] = np.log(ratio)
  else:
    positive_negative_ratio[word] = -np.log(1/(ratio+0.01))

In [13]:
positive_negative_ratio.most_common()[0:30]

[('edie', 4.6913478822291435),
 ('antwone', 4.477336814478207),
 ('din', 4.406719247264253),
 ('gunga', 4.189654742026425),
 ('goldsworthy', 4.174387269895637),
 ('gypo', 4.0943445622221),
 ('yokai', 4.0943445622221),
 ('paulie', 4.07753744390572),
 ('visconti', 3.9318256327243257),
 ('flavia', 3.9318256327243257),
 ('blandings', 3.871201010907891),
 ('kells', 3.871201010907891),
 ('brashear', 3.8501476017100584),
 ('gino', 3.828641396489095),
 ('deathtrap', 3.8066624897703196),
 ('harilal', 3.713572066704308),
 ('panahi', 3.713572066704308),
 ('ossessione', 3.6635616461296463),
 ('tsui', 3.6375861597263857),
 ('caruso', 3.6375861597263857),
 ('sabu', 3.6109179126442243),
 ('ahmad', 3.6109179126442243),
 ('khouri', 3.58351893845611),
 ('dominick', 3.58351893845611),
 ('aweigh', 3.5553480614894135),
 ('mj', 3.5553480614894135),
 ('mcintire', 3.5263605246161616),
 ('kriemhild', 3.5263605246161616),
 ('blackie', 3.4965075614664802),
 ('daisies', 3.4965075614664802)]

In [14]:
list(reversed(positive_negative_ratio.most_common()))[0:30]

[('rosarios', -4.605170185988092),
 ('frewer', -4.605170185988092),
 ('manu', -4.605170185988092),
 ('borel', -4.605170185988092),
 ('swinton', -4.605170185988092),
 ('sagemiller', -4.605170185988092),
 ('summersisle', -4.605170185988092),
 ('qi', -4.605170185988092),
 ('redline', -4.605170185988092),
 ('slipstream', -4.605170185988092),
 ('bolo', -4.605170185988092),
 ('emraan', -4.605170185988092),
 ('geico', -4.605170185988092),
 ('cato', -4.605170185988092),
 ('liliom', -4.605170185988092),
 ('rajni', -4.605170185988092),
 ('mayeda', -4.605170185988092),
 ('crapfest', -4.605170185988092),
 ('tmtm', -4.605170185988092),
 ('sued', -4.605170185988092),
 ('keyes', -4.605170185988092),
 ('nichole', -4.605170185988092),
 ('straightheads', -4.605170185988092),
 ('aluminium', -4.605170185988092),
 ('groaning', -4.605170185988092),
 ('templars', -4.605170185988092),
 ('krista', -4.605170185988092),
 ('spandex', -4.605170185988092),
 ('unisols', -4.605170185988092),
 ('mache', -4.60517018598

# Creating the Input/Output Data

In [15]:
vocab = set(total_words)
vocab_size = len(vocab)
print(vocab_size)

74074


In [16]:
# Creating a row vector of size: vocab_size filled with 0

layer_0 = np.zeros((1,vocab_size))
layer_0

array([[0., 0., 0., ..., 0., 0., 0.]])

In [17]:
word2index = {}

for i,word in enumerate(vocab):
  word2index[word] = i
word2index

{'': 0,
 'microcosmos': 1,
 'bawling': 2,
 'wingin': 3,
 'rd': 4,
 'argo': 5,
 'signal': 6,
 'belivably': 7,
 'disbelief': 8,
 'terrorise': 9,
 'bams': 10,
 'ricardo': 11,
 'nightclubs': 12,
 'laments': 13,
 'rappaport': 14,
 'incestual': 15,
 'sequestered': 16,
 'ventresca': 17,
 'meals': 18,
 'johannsson': 19,
 'jacksons': 20,
 'nisep': 21,
 'reiser': 22,
 'trios': 23,
 'imagina': 24,
 'babble': 25,
 'electric': 26,
 'earn': 27,
 'unacknowledged': 28,
 'collums': 29,
 'replies': 30,
 'expediton': 31,
 'brows': 32,
 'samara': 33,
 'crashes': 34,
 'migenes': 35,
 'impulsively': 36,
 'anthrax': 37,
 'cappucino': 38,
 'iamaseal': 39,
 'joists': 40,
 'tower': 41,
 'simplifies': 42,
 'sabotaged': 43,
 'arbus': 44,
 'seafood': 45,
 'foregoes': 46,
 'flowery': 47,
 'girth': 48,
 'markham': 49,
 'kippei': 50,
 'mccartney': 51,
 'pikachu': 52,
 'noteworthy': 53,
 'za': 54,
 'figments': 55,
 'lurie': 56,
 'prided': 57,
 'duckburg': 58,
 'interstellar': 59,
 'tetsuro': 60,
 'alcoholically': 61,


In [18]:
def update_layer_0(review):
  global layer_0

  layer_0 *= 0

  for word in review.split(" "):
    layer_0[0][word2index[word]] += 1

  return layer_0[0][:]
  
update_layer_0(reviews[0])

array([18.,  0.,  0., ...,  0.,  0.,  0.])

In [0]:
def get_target_for_label(label):
  if label == 'POSITIVE':
    return 1
  else:
    return 0

In [20]:
get_target_for_label(labels[0])

1

# Building the Neural Network

In [0]:
import time
import sys
import numpy as np

# Encapsulate our neural network in a class
class SentimentNetwork:
    def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """
        # Assign a seed to our random number generator to ensure we get
        # reproducable results during development 
        np.random.seed(1)

        # process the reviews and their associated labels so that everything
        # is ready for training
        self.pre_process_data(reviews, labels)
        
        # Build the network to have the number of hidden nodes and the learning rate that
        # were passed into this initializer. Make the same number of input nodes as
        # there are vocabulary words and create a single output node.
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels):
        
        # populate review_vocab with all of the words in the given reviews
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                review_vocab.add(word)

        # Convert the vocabulary set to a list so we can access words via indices
        self.review_vocab = list(review_vocab)
        
        # populate label_vocab with all of the words in the given labels.
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        # Convert the label vocabulary set to a list so we can access labels via indices
        self.label_vocab = list(label_vocab)
        
        # Store the sizes of the review and label vocabularies.
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        # Create a dictionary of words in the vocabulary mapped to index positions
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        # Create a dictionary of labels mapped to index positions
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i

    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Store the learning rate
        self.learning_rate = learning_rate

        # Initialize weights

        # These are the weights between the input layer and the hidden layer.
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))

        # These are the weights between the hidden layer and the output layer.
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        
        ## New for Project 5: Removed self.layer_0; added self.layer_1
        # The input layer, a two-dimensional matrix with shape 1 x hidden_nodes
        self.layer_1 = np.zeros((1,hidden_nodes))
    
    ## New for Project 5: Removed update_input_layer function
    
    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
        else:
            return 0
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    ## New for Project 5: changed name of first parameter form 'training_reviews' 
    #                     to 'training_reviews_raw'
    def train(self, training_reviews_raw, training_labels):

        ## New for Project 5: pre-process training reviews so we can deal 
        #                     directly with the indices of non-zero inputs
        training_reviews = list()
        for review in training_reviews_raw:
            indices = set()
            for word in review.split(" "):
                if(word in self.word2index.keys()):
                    indices.add(self.word2index[word])
            training_reviews.append(list(indices))

        # make sure out we have a matching number of reviews and labels
        assert(len(training_reviews) == len(training_labels))
        
        # Keep track of correct predictions to display accuracy during training 
        correct_so_far = 0

        # Remember when we started for printing time statistics
        start = time.time()
        
        # loop through all the given reviews and run a forward and backward pass,
        # updating weights for every item
        for i in range(len(training_reviews)):
            
            # Get the next review and its correct label
            review = training_reviews[i]
            label = training_labels[i]
            
            #### Implement the forward pass here ####
            ### Forward pass ###

            ## New for Project 5: Removed call to 'update_input_layer' function
            #                     because 'layer_0' is no longer used

            # Hidden layer
            ## New for Project 5: Add in only the weights for non-zero items
            self.layer_1 *= 0
            for index in review:
                self.layer_1 += self.weights_0_1[index]

            # Output layer
            ## New for Project 5: changed to use 'self.layer_1' instead of 'local layer_1'
            layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))            
            
            #### Implement the backward pass here ####
            ### Backward pass ###

            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # Backpropagated error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # Update the weights
            ## New for Project 5: changed to use 'self.layer_1' instead of local 'layer_1'
            self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
            
            ## New for Project 5: Only update the weights that were used in the forward pass
            for index in review:
                self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step

            # Keep track of correct predictions.
            if(layer_2 >= 0.5 and label == 'POSITIVE'):
                correct_so_far += 1
            elif(layer_2 < 0.5 and label == 'NEGATIVE'):
                correct_so_far += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the training process. 
            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        
        # keep track of how many correct predictions we make
        correct = 0

        # we'll time how many predictions per second we make
        start = time.time()

        # Loop through each of the given reviews and call run to predict
        # its label. 
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the prediction process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """
        # Run a forward pass through the network, like in the "train" function.
        
        ## New for Project 5: Removed call to update_input_layer function
        #                     because layer_0 is no longer used

        # Hidden layer
        ## New for Project 5: Identify the indices used in the review and then add
        #                     just those weights to layer_1 
        self.layer_1 *= 0
        unique_indices = set()
        for word in review.lower().split(" "):
            if word in self.word2index.keys():
                unique_indices.add(self.word2index[word])
        for index in unique_indices:
            self.layer_1 += self.weights_0_1[index]
        
        # Output layer
        ## New for Project 5: changed to use self.layer_1 instead of local layer_1
        layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
        
        # Return POSITIVE for values above greater-than-or-equal-to 0.5 in the output layer;
        # return NEGATIVE for other values
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"


In [28]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.001)
mlp.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):1180. #Correct:1941 #Trained:2501 Training Accuracy:77.6%
Progress:20.8% Speed(reviews/sec):1152. #Correct:3988 #Trained:5001 Training Accuracy:79.7%
Progress:31.2% Speed(reviews/sec):1161. #Correct:6086 #Trained:7501 Training Accuracy:81.1%
Progress:41.6% Speed(reviews/sec):1155. #Correct:8205 #Trained:10001 Training Accuracy:82.0%
Progress:52.0% Speed(reviews/sec):1152. #Correct:10338 #Trained:12501 Training Accuracy:82.6%
Progress:62.5% Speed(reviews/sec):1154. #Correct:12424 #Trained:15001 Training Accuracy:82.8%
Progress:72.9% Speed(reviews/sec):1150. #Correct:14525 #Trained:17501 Training Accuracy:82.9%
Progress:83.3% Speed(reviews/sec):1146. #Correct:16698 #Trained:20001 Training Accuracy:83.4%
Progress:93.7% Speed(reviews/sec):1145. #Correct:18857 #Trained:22501 Training Accuracy:83.8%
Progress:99.9% Speed(reviews/sec):1146. #Correct:20173 #Trained:24000 Training

In [29]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:99.9% Speed(reviews/sec):1676. #Correct:848 #Tested:1000 Testing Accuracy:84.8%

In [0]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [33]:
hist, edges = np.histogram(list(map(lambda x:x[1],positive_negative_ratio.most_common())), density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Word Positive/Negative Affinity Distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

  """Entry point for launching an IPython kernel.


We should get rid of the words which have a positive-negative ratio close to 0 , as they dont contribute much correlation to the review labels

In [0]:
frequency_frequency = Counter()

for word, cnt in total_words.most_common():
    frequency_frequency[cnt] += 1

In [36]:
hist, edges = np.histogram(list(map(lambda x:x[1],frequency_frequency.most_common())), density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="The frequency distribution of the words in our corpus")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

  """Entry point for launching an IPython kernel.


This graph shows that many words have a large frequency , so its better to get rid of them as well.