# Sentiment Analysis 
 - Based on from Andrew Trask's teaching on Udacity 
 - Optimize Sentiment Analysis Neural Networks:
   - Understand Neural noise and clear it. For example, 'the' counts 18, 'the' will be weighted most comparing to other important word which may only count 1. 'the' won't contribute in makeing the NN learn.
   - Understand Inefficiencies and solve it. A lot of zeros in the input layer makes multiply calculations, which are all in vain and wasting. We need to exclude these calculations. And we also need to stop multiplying the weight with 1.
   - Keep reducing noises by Getting rid of words those are too frequent or too rare.

## Find intuition before start

In [5]:
from collections import Counter
import numpy as np

In [6]:
positive_counts = Counter()
negative_counts = Counter()
reviews_positive = ['Great','Excellent','Happy','cool','Nice','Great','Great','Great','Happy','Happy','Happy','Happy',]
reviews_negative = ['sucks','not-fun','unsatified','bad','terrible','bad','terrible','bad','terrible']
for p in reviews_positive:
    positive_counts[p] += 1

for p in reviews_negative:
    negative_counts[p] += 1

In [7]:
positive_counts.most_common()

[('Happy', 5), ('Great', 4), ('Nice', 1), ('Excellent', 1), ('cool', 1)]

In [8]:
negative_counts.most_common()

[('terrible', 3), ('bad', 3), ('sucks', 1), ('unsatified', 1), ('not-fun', 1)]

In [9]:
for term, cnt in list(positive_counts.most_common()):
    print(term,'\t\t',cnt,'\t',positive_counts[term])

Happy 		 5 	 5
Great 		 4 	 4
Nice 		 1 	 1
Excellent 		 1 	 1
cool 		 1 	 1


In [10]:
positive_counts['bad']  # if the key is not exist, it return 0

0

In [11]:
total_counts = Counter()
total_counts = positive_counts + negative_counts
total_counts

Counter({'Excellent': 1,
         'Great': 4,
         'Happy': 5,
         'Nice': 1,
         'bad': 3,
         'cool': 1,
         'not-fun': 1,
         'sucks': 1,
         'terrible': 3,
         'unsatified': 1})

In [12]:
pos_neg_ratios = Counter()
for term, cnt in list(total_counts.most_common()):
    pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
    pos_neg_ratios[term] = pos_neg_ratio
pos_neg_ratios.most_common()

[('Happy', 5.0),
 ('Great', 4.0),
 ('Nice', 1.0),
 ('cool', 1.0),
 ('Excellent', 1.0),
 ('sucks', 0.0),
 ('unsatified', 0.0),
 ('terrible', 0.0),
 ('bad', 0.0),
 ('not-fun', 0.0)]

In [13]:
pos_neg_ratios2 = Counter()
for word, ratio in pos_neg_ratios.most_common():
    if (ratio>=1):
        pos_neg_ratios2[word] = np.log(ratio)
    else:
        pos_neg_ratios2[word] = -np.log(1 /(ratio + 0.01))
        
pos_neg_ratios2.most_common()

[('Happy', 1.6094379124341003),
 ('Great', 1.3862943611198906),
 ('Nice', 0.0),
 ('cool', 0.0),
 ('Excellent', 0.0),
 ('sucks', -4.6051701859880918),
 ('not-fun', -4.6051701859880918),
 ('terrible', -4.6051701859880918),
 ('bad', -4.6051701859880918),
 ('unsatified', -4.6051701859880918)]

## Curate a Dataset

In [1]:
from azureml import Workspace

ws = Workspace()
ds1 = ws.datasets['sentiment_analysis_reviews.txt']
ds2 = ws.datasets['sentiment_analysis_labels.txt']
frame1 = ds1.to_dataframe()
frame2 = ds2.to_dataframe()
reviews = frame1[0].tolist()
labels = frame2[0].tolist()

In [1]:
g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

In [3]:
print(len(reviews))
print(reviews[0])
print(labels[0])

25000
bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
positive


In [15]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

negative	:	this movie is terrible but it has some good effects .  ...
positive	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
negative	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
positive	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
negative	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
positive	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [25]:
from collections import Counter
import time
import sys
import numpy as np

In [49]:
# Let's tweak our network from before to model these phenomena
class SentimentNetwork:
    def __init__(self, reviews,labels,min_count = 10,polarity_cutoff = 0.1,hidden_nodes = 10, learning_rate = 0.1):
       
        np.random.seed(1)
    
        self.pre_process_data(reviews, polarity_cutoff, min_count)
        
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
        
        
    def pre_process_data(self,reviews, polarity_cutoff,min_count):
        
        positive_counts = Counter()
        negative_counts = Counter()
        total_counts = Counter()
        
        ## 1. Count words in POSITIVE  and NAGATIVE accordingly
        for i in range(len(reviews)):
            if(labels[i] == 'POSITIVE'):
                for word in reviews[i].split(" "):
                    positive_counts[word] += 1
                    total_counts[word] += 1
            else:
                for word in reviews[i].split(" "):
                    negative_counts[word] += 1
                    total_counts[word] += 1
                    
        ## 2. Make a ratio that makes more sense 
        pos_neg_ratios = Counter()

        for term,cnt in list(total_counts.most_common()):
            if(cnt >= 50): # adjust according to the examination
                pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
                pos_neg_ratios[term] = pos_neg_ratio
                
        ## 3. Log Transform ratio to Positive numbers when it is POSITIVE, Negative numbers when it is NEGATIVE 
        for word,ratio in pos_neg_ratios.most_common():
            if(ratio > 1):
                pos_neg_ratios[word] = np.log(ratio)
            else:
                pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))
        
        ## 4. collecting Words for the Bag of words
        ### cnt is greater than min_count, ratios are greater than polarity_cutoff
        ### using set() will avoid duplications
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                if(total_counts[word] > min_count):
                    if(word in pos_neg_ratios.keys()):
                        if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):
                            review_vocab.add(word)
                    else:
                        review_vocab.add(word)
        self.review_vocab = list(review_vocab)
        
        ## 5. collecting Words for the Bag of words of Labels. 
        ### This will be especially usefull for the case with many label categories.        
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)        
        self.label_vocab = list(label_vocab)
        
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        ## 6. word2vector, this will make it possible to assign numbers for the NN inputs
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        ## 7. word2vector of Labels, this will make it possible to assign numbers for the NN inputs
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
         
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Initialize weights
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        
        # np.random.normal(loc=0,scale=1**-5, size=(2,3))    
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        
        self.learning_rate = learning_rate
        
        ### Creation of vector variable, no need to allocate memories again and again.
        self.layer_0 = np.zeros((1,input_nodes))
        self.layer_1 = np.zeros((1,hidden_nodes))
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    def update_input_layer(self,review):

        # clear out previous state, reset the layer to be all 0s
        self.layer_0 *= 0
        for word in review.split(" "):
            self.layer_0[0][self.word2index[word]] = 1

    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
        else:
            return 0
        
    def train(self, training_reviews_raw, training_labels):
        
        ## Efficiency improvement. Find out indexes those have values, skip all 0 valued indexes
        training_reviews = list()
        for review in training_reviews_raw:
            indices = set()
            for word in review.split(" "):
                if(word in self.word2index.keys()):
                    indices.add(self.word2index[word])
            training_reviews.append(list(indices))
        
        assert(len(training_reviews) == len(training_labels))
        
        correct_so_far = 0  # this will tell us how well the NN doing
        
        start = time.time()
        
        for i in range(len(training_reviews)):
            
            review = training_reviews[i]
            label = training_labels[i]
            
            #### Implement the forward pass here ####
            ### Forward pass ###

            # Input Layer
            # commented in order to improve efficiency
            #self.update_input_layer(review)

            # Hidden layer
            #layer_1 = self.layer_0.dot(self.weights_0_1)
            self.layer_1 *= 0
            for index in review:
                self.layer_1 += self.weights_0_1[index]
            
            # Output layer
            layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))

            #### Implement the backward pass here ####
            ### Backward pass ###

            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # Backpropagated error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # Update the weights
            self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
            
            ## Improve inefficiency
            for index in review:
                self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step

            if(layer_2 >= 0.5 and label == 'POSITIVE'):
                correct_so_far += 1
            if(layer_2 < 0.5 and label == 'NEGATIVE'):
                correct_so_far += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + 
                             "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + 
                             " #Correct:" + str(correct_so_far) + 
                             " #Trained:" + str(i+1) + 
                             " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            
            ## every 2500 reviews generarate a new line ---
            #if(i % 2500 == 0):
            #    print("")
        
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0
        
        start = time.time()
        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] +
                             "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] +
                             "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + 
                             " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        
        # Input Layer
        #self.update_input_layer(review.lower())


        # Hidden layer
        self.layer_1 *= 0
        unique_indices = set()
        for word in review.lower().split(" "):
            if word in self.word2index.keys():
                unique_indices.add(self.word2index[word])
        for index in unique_indices:
            self.layer_1 += self.weights_0_1[index]
        
        # Output layer
        layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"
        

In [50]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.05,learning_rate=0.1)

In [51]:
mlp.weights_0_1

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [52]:
# test how the prediction looks like
mlp.test(reviews[-1000:],labels[-1000:])

Progress:99.9% Speed(reviews/sec):2164.% #Correct:0 #Tested:1000 Testing Accuracy:0.0%

In [48]:
len(reviews)

25000

In [53]:
# training
mlp.train(reviews[:-1000],labels[:-1000])

Progress:99.9% Speed(reviews/sec):1748. #Correct:0 #Trained:24000 Training Accuracy:0.0%