# Sentiment Classification

## Load and explore the Dataset

In [112]:
g = open('reviews.txt','r')
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r')
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [113]:
len(reviews)

25000

In [114]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [115]:
labels[0]

'POSITIVE'

## Develop a Predictive Theory

In [116]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...") 

print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(1)
pretty_print_review_and_label(2)
pretty_print_review_and_label(3)
pretty_print_review_and_label(4)
pretty_print_review_and_label(5)
pretty_print_review_and_label(6)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	story of a man who has unnatural feelings for a pig . starts out with a opening ...
POSITIVE	:	homelessness  or houselessness as george carlin stated  has been an issue for ye...
NEGATIVE	:	airport    starts as a brand new luxury    plane is loaded up with valuable pain...
POSITIVE	:	brilliant over  acting by lesley ann warren . best dramatic hobo lady i have eve...
NEGATIVE	:	this film lacked something i couldn  t put my finger on at first charisma on the...
POSITIVE	:	this is easily the most underrated film inn the brooks cannon . sure  its flawed...


## Quick Theory Validation

In [117]:
from collections import Counter
import numpy as np

We'll create three `Counter` objects, one for words from postive reviews, one for words from negative reviews, and one for all the words.

In [118]:
# Create three Counter objects to store positive, negative and total counts
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [119]:
# Loop over all the words in all the reviews and increment the counts in the appropriate counter objects
for i in range(len(reviews)):
    if (labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1 
            total_counts[word] += 1
    
    elif (labels[i] == 'NEGATIVE'):        
        for word in reviews[i].split(" "):
            negative_counts[word] += 1 
            total_counts[word] += 1

Run the following two cells to list the words used in positive reviews and negative reviews, respectively, ordered from most to least commonly used. 

In [120]:
# Examine the counts of the most common words in positive reviews
positive_counts.most_common(10)

[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235)]

In [121]:
# Examine the counts of the most common words in negative reviews
negative_counts.most_common(10)

[('', 561462),
 ('.', 167538),
 ('the', 163389),
 ('a', 79321),
 ('and', 74385),
 ('of', 69009),
 ('to', 68974),
 ('br', 52637),
 ('is', 50083),
 ('it', 48327)]

In [122]:
# Create Counter object to store positive/negative ratios
pos_neg_ratios = Counter()

# Calculate the ratios of positive and negative uses of the most common words
# Note: consider words to be "common" if they've been used at least 100 times

for word in total_counts:
    if (total_counts[word] > 100):
        ratio = positive_counts[word] / float(negative_counts[word]+1)
        pos_neg_ratios[word] = ratio
    else:
        pass
    

Examine the ratios you've calculated for a few words:

In [123]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'the' = 1.0607993145235326
Pos-to-neg ratio for 'amazing' = 4.022813688212928
Pos-to-neg ratio for 'terrible' = 0.17744252873563218


Looking closely at the values you just calculated, we see the following:

* Words that you would expect to see more often in positive reviews – like "amazing" – have a ratio greater than 1. The more skewed a word is toward postive, the farther from 1 its positive-to-negative ratio  will be.
* Words that you would expect to see more often in negative reviews – like "terrible" – have positive values that are less than 1. The more skewed a word is toward negative, the closer to zero its positive-to-negative ratio will be.
* Neutral words, which don't really convey any sentiment because you would expect to see them in all sorts of reviews – like "the" – have values very close to 1. A perfectly neutral word – one that was used in exactly the same number of positive reviews as negative reviews – would be almost exactly 1. The `+1` we suggested you add to the denominator slightly biases words toward negative, but it won't matter because it will be a tiny bias and later we'll be ignoring words that are too close to neutral anyway.

Ok, the ratios tell us which words are used more often in postive or negative reviews, but the specific values we've calculated are a bit difficult to work with. A very positive word like "amazing" has a value above 4, whereas a very negative word like "terrible" has a value around 0.18. Those values aren't easy to compare for a couple of reasons:

* Right now, 1 is considered neutral, but the absolute value of the postive-to-negative rations of very postive words is larger than the absolute value of the ratios for the very negative words. So there is no way to directly compare two numbers and see if one word conveys the same magnitude of positive sentiment as another word conveys negative sentiment. So we should center all the values around neutral so the absolute value from neutral of the postive-to-negative ratio for a word would indicate how much sentiment (positive or negative) that word conveys.
* When comparing absolute values it's easier to do that around zero than one. 

To fix these issues, we'll convert all of our ratios to new values using logarithms. In the end, extremely positive and extremely negative words will have positive-to-negative ratios with similar magnitudes but opposite signs.

In [124]:
pos_neg_ratios.most_common(10)

[('edie', 109.0),
 ('paulie', 59.0),
 ('felix', 23.4),
 ('polanski', 16.833333333333332),
 ('matthau', 16.555555555555557),
 ('victoria', 14.6),
 ('mildred', 13.5),
 ('gandhi', 12.666666666666666),
 ('flawless', 11.6),
 ('superbly', 9.583333333333334)]

In [125]:
# Convert ratios to logs

for word,ratio in pos_neg_ratios.most_common():
    pos_neg_ratios[word] = np.log(ratio)

Examine the new ratios you've calculated for the same words from before:

In [126]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'the' = 0.05902269426102881
Pos-to-neg ratio for 'amazing' = 1.3919815802404802
Pos-to-neg ratio for 'terrible' = -1.7291085042663878


If everything worked, now you should see neutral words with values close to zero. In this case, "the" is near zero but slightly positive, so it was probably used in more positive reviews than negative reviews. But look at "amazing"'s ratio - it's above `1`, showing it is clearly a word with positive sentiment. And "terrible" has a similar score, but in the opposite direction, so it's below `-1`. It's now clear that both of these words are associated with specific, opposing sentiments.

Now run the following cells to see more ratios. 

The first cell displays all the words, ordered by how associated they are with postive reviews. (Your notebook will most likely truncate the output so you won't actually see *all* the words in the list.)

The second cell displays the 30 words most associated with negative reviews by reversing the order of the first list and then looking at the first 30 words. (If you want the second cell to display all the words, ordered by how associated they are with negative reviews, you could just write `reversed(pos_neg_ratios.most_common())`.)

You should continue to see values similar to the earlier ones we checked – neutral words will be close to `0`, words will get more positive as their ratios approach and go above `1`, and words will get more negative as their ratios approach and go below `-1`. That's why we decided to use the logs instead of the raw ratios.

## Transforming Text into Numbers

In [127]:
# Create set named "vocab" containing all of the words from all of the reviews
vocab = set(total_counts.keys())

Run the following cell to check your vocabulary size. If everything worked correctly, it should print **74074**

In [128]:
vocab_size = len(vocab)
print(vocab_size)

74074


In [129]:
# Create layer_0 matrix with dimensions 1 by vocab_size, initially filled with zeros
layer_0 = np.zeros((1,vocab_size))

In [130]:
layer_0.shape

(1, 74074)

In [131]:
# Create a dictionary of words in the vocabulary mapped to index positions
word2index = {}
for i,word in enumerate(vocab):
    word2index[word] = i
    
# display the map of words to indices
# word2index

In [132]:
# Complete the implementation of `update_input_layer`
# It should count how many times each word is used in the given review, 
# and then store those counts at the appropriate indices inside `layer_0`.

def update_input_layer(review):
    """ Modify the global layer_0 to represent the vector form of review.
    The element at a given index of layer_0 should represent
    how many times the given word occurs in the review.
    Args:
        review(string) - the string of the review
    Returns:
        None
    """
    global layer_0
    # clear out previous state by resetting the layer to be all 0s
    layer_0 *= 0

    # count how many times each word is used in the given review and store the results in layer_0 
    for word in review.split(" "):
        layer_0[0][word2index[word]] += 1 

In [133]:
# Example
update_input_layer(reviews[0])
layer_0

array([[18.,  0.,  0., ...,  0.,  0.,  0.]])

In [134]:
# Complete the implementation of `get_target_for_labels`. It should return `0` or `1`, 
# depending on whether the given label is `NEGATIVE` or `POSITIVE`, respectively.

def get_target_for_label(label):
    """Convert a label to `0` or `1`.
    Args:
        label(string) - Either "POSITIVE" or "NEGATIVE".
    Returns:
        `0` or `1`.
    """
    # TODO: Your code here
    if(label == "POSITIVE"):
        return 1
    else:
        return 0

In [135]:
# Example
labels[0]
get_target_for_label(labels[0])

1

## Building a Neural Network

To build the neural network, we need to execute the following steps:
- Create a basic neural network with an input layer, a hidden layer, and an output layer. 
- Do **not** add a non-linearity in the hidden layer. That is, do not use an activation function when calculating the hidden layer outputs.
- Create the training data
- Implement the `pre_process_data` function to create the vocabulary for our training data generating functions
- Ensure `train` trains over the entire corpus

In [136]:
import time
import sys
import numpy as np

# Encapsulate our neural network in a class
class SentimentNetwork:
    def __init__(self, reviews, labels, hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """
        # Assign a seed to our random number generator to ensure we get
        # reproducable results during development 
        np.random.seed(1)

        # process the reviews and their associated labels so that everything
        # is ready for training
        self.pre_process_data(reviews, labels)
        
        # Build the network to have the number of hidden nodes and the learning rate that
        # were passed into this initializer. Make the same number of input nodes as
        # there are vocabulary words and create a single output node.
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

        
    def pre_process_data(self, reviews, labels):
        
        # populate review_vocab with all of the words in the given reviews
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                review_vocab.add(word)

        # Convert the vocabulary set to a list so we can access words via indices
        self.review_vocab = list(review_vocab)
        
        # populate label_vocab with all of the words in the given labels.
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        # Convert the label vocabulary set to a list so we can access labels via indices
        self.label_vocab = list(label_vocab)
        
        # Store the sizes of the review and label vocabularies.
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        # Create a dictionary of words in the vocabulary mapped to index positions
        self.word2index = {}
        for i,word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        # Create a dictionary of labels mapped to index positions
        self.label2index = {}
        for i,label in enumerate(self.label_vocab):
            self.label2index[label] = i
        
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Store the number of nodes in input, hidden, and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Store the learning rate
        self.learning_rate = learning_rate

        # Initialize weights
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0,self.hidden_nodes**-0.5,(self.hidden_nodes,self.output_nodes))
        
        # Create layer one
        self.layer_1 = np.zeros((1,hidden_nodes))            
                
    def get_target_for_label(self,label):

        """Convert a label to `0` or `1`.
        Args:
            label(string) - Either "POSITIVE" or "NEGATIVE".
        Returns:
            `0` or `1`.
        """
        
        if(label == "POSITIVE"):
            return 1
        else:
            return 0
        
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)

    
    def train(self, training_reviews_raw, training_labels):
        training_reviews = list()
        for review in training_reviews_raw:
            indices = set()
            for word in review.split(" "):
                if(word in self.word2index.keys()):
                    indices.add(self.word2index[word])
            training_reviews.append(list(indices))

        # make sure out we have a matching number of reviews and labels
        assert(len(training_reviews_raw) == len(training_labels))
        
        # Keep track of correct predictions to display accuracy during training 
        correct_so_far = 0
        
        # Remember when we started for printing time statistics
        start = time.time()

        # loop through all the given reviews and run a forward and backward pass,
        # updating weights for every item
        for i in range(len(training_reviews)):
            
            review = training_reviews[i]
            label = training_labels[i]

            # Hidden layer
            self.layer_1 *= 0
            for index in review:
                self.layer_1 += self.weights_0_1[index]
            
            # Output layer:
            layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))

            # Output error: 
            layer_2_error = layer_2 - self.get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
            
            # Backprogagated error:
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error
            
            # Update weights:
            self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step (MINUS SIGN)
            
            for index in review:                        
                self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step (MINUS SIGN)
            
            # Keep track of correct predictions
            if (layer_2 >= 0.5) & (label == "POSITIVE"):
                correct_so_far += 1
            elif (layer_2 < 0.5) & (label == "NEGATIVE"):
                correct_so_far += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the training process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            if (i % 1000 == 0):
                sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                                 + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                                 + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                                 + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
                print("")
    
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        
        # keep track of how many correct predictions we make
        correct = 0

        # we'll time how many predictions per second we make
        start = time.time()

        # Loop through each of the given reviews and call run to predict
        # its label. 
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the prediction process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    
    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """
        
        # Hidden layer 
        self.layer_1 *= 0
        unique_indices = set()
        for word in review.lower().split(" "):
            if word in self.word2index.keys():
                unique_indices.add(self.word2index[word])
        for index in unique_indices:
            self.layer_1 += self.weights_0_1[index]                            
            
        # Output layer:
        layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"


Run the following cell to recreate the network and train it once again.

In [137]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)

In [138]:
mlp.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:4.16% Speed(reviews/sec):1327. #Correct:745 #Trained:1001 Training Accuracy:74.4%
Progress:8.33% Speed(reviews/sec):1346. #Correct:1541 #Trained:2001 Training Accuracy:77.0%
Progress:12.5% Speed(reviews/sec):1306. #Correct:2376 #Trained:3001 Training Accuracy:79.1%
Progress:16.6% Speed(reviews/sec):1312. #Correct:3183 #Trained:4001 Training Accuracy:79.5%
Progress:20.8% Speed(reviews/sec):1312. #Correct:3994 #Trained:5001 Training Accuracy:79.8%
Progress:25.0% Speed(reviews/sec):1323. #Correct:4832 #Trained:6001 Training Accuracy:80.5%
Progress:29.1% Speed(reviews/sec):1325. #Correct:5700 #Trained:7001 Training Accuracy:81.4%
Progress:33.3% Speed(reviews/sec):1336. #Correct:6552 #Trained:8001 Training Accuracy:81.8%
Progress:37.5% Speed(reviews/sec):1338. #Correct:7408 #Trained:9001 Training Accuracy:82.3%
Progress:41.6% Speed(reviews/sec):1346. #Correct:8278 #Trained:10001 Training Accuracy:82.

In [139]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Tested:1 Testing Accuracy:100.%Progress:0.1% Speed(reviews/sec):772.2 #Correct:1 #Tested:2 Testing Accuracy:50.0%Progress:0.2% Speed(reviews/sec):670.4 #Correct:2 #Tested:3 Testing Accuracy:66.6%Progress:0.3% Speed(reviews/sec):841.1 #Correct:3 #Tested:4 Testing Accuracy:75.0%Progress:0.4% Speed(reviews/sec):1047. #Correct:4 #Tested:5 Testing Accuracy:80.0%Progress:0.5% Speed(reviews/sec):1048. #Correct:5 #Tested:6 Testing Accuracy:83.3%Progress:0.6% Speed(reviews/sec):1044. #Correct:6 #Tested:7 Testing Accuracy:85.7%Progress:0.7% Speed(reviews/sec):1125. #Correct:7 #Tested:8 Testing Accuracy:87.5%Progress:0.8% Speed(reviews/sec):1211. #Correct:8 #Tested:9 Testing Accuracy:88.8%Progress:0.9% Speed(reviews/sec):1134. #Correct:9 #Tested:10 Testing Accuracy:90.0%Progress:1.0% Speed(reviews/sec):1192. #Correct:10 #Tested:11 Testing Accuracy:90.9%Progress:1.1% Speed(reviews/sec):1264. #Correct:11 #Tested:12 Testing Accuracy:91.6%

Progress:30.3% Speed(reviews/sec):1480. #Correct:270 #Tested:304 Testing Accuracy:88.8%Progress:30.4% Speed(reviews/sec):1483. #Correct:271 #Tested:305 Testing Accuracy:88.8%Progress:30.5% Speed(reviews/sec):1486. #Correct:271 #Tested:306 Testing Accuracy:88.5%Progress:30.6% Speed(reviews/sec):1473. #Correct:272 #Tested:307 Testing Accuracy:88.5%Progress:30.7% Speed(reviews/sec):1468. #Correct:272 #Tested:308 Testing Accuracy:88.3%Progress:30.8% Speed(reviews/sec):1458. #Correct:273 #Tested:309 Testing Accuracy:88.3%Progress:30.9% Speed(reviews/sec):1444. #Correct:273 #Tested:310 Testing Accuracy:88.0%Progress:31.0% Speed(reviews/sec):1445. #Correct:274 #Tested:311 Testing Accuracy:88.1%Progress:31.1% Speed(reviews/sec):1443. #Correct:275 #Tested:312 Testing Accuracy:88.1%Progress:31.2% Speed(reviews/sec):1444. #Correct:276 #Tested:313 Testing Accuracy:88.1%Progress:31.3% Speed(reviews/sec):1432. #Correct:277 #Tested:314 Testing Accuracy:88.2%Progress:31.4% Speed(reviews/se

Progress:52.4% Speed(reviews/sec):1282. #Correct:465 #Tested:525 Testing Accuracy:88.5%Progress:52.5% Speed(reviews/sec):1283. #Correct:466 #Tested:526 Testing Accuracy:88.5%Progress:52.6% Speed(reviews/sec):1283. #Correct:467 #Tested:527 Testing Accuracy:88.6%Progress:52.7% Speed(reviews/sec):1282. #Correct:468 #Tested:528 Testing Accuracy:88.6%Progress:52.8% Speed(reviews/sec):1280. #Correct:469 #Tested:529 Testing Accuracy:88.6%Progress:52.9% Speed(reviews/sec):1281. #Correct:470 #Tested:530 Testing Accuracy:88.6%Progress:53.0% Speed(reviews/sec):1279. #Correct:471 #Tested:531 Testing Accuracy:88.7%Progress:53.1% Speed(reviews/sec):1280. #Correct:471 #Tested:532 Testing Accuracy:88.5%Progress:53.2% Speed(reviews/sec):1281. #Correct:472 #Tested:533 Testing Accuracy:88.5%Progress:53.3% Speed(reviews/sec):1282. #Correct:473 #Tested:534 Testing Accuracy:88.5%Progress:53.4% Speed(reviews/sec):1284. #Correct:474 #Tested:535 Testing Accuracy:88.5%Progress:53.5% Speed(reviews/se

Progress:82.0% Speed(reviews/sec):1282. #Correct:708 #Tested:821 Testing Accuracy:86.2%Progress:82.1% Speed(reviews/sec):1281. #Correct:708 #Tested:822 Testing Accuracy:86.1%Progress:82.2% Speed(reviews/sec):1281. #Correct:708 #Tested:823 Testing Accuracy:86.0%Progress:82.3% Speed(reviews/sec):1281. #Correct:709 #Tested:824 Testing Accuracy:86.0%Progress:82.4% Speed(reviews/sec):1281. #Correct:710 #Tested:825 Testing Accuracy:86.0%Progress:82.5% Speed(reviews/sec):1280. #Correct:710 #Tested:826 Testing Accuracy:85.9%Progress:82.6% Speed(reviews/sec):1272. #Correct:711 #Tested:827 Testing Accuracy:85.9%Progress:82.7% Speed(reviews/sec):1273. #Correct:712 #Tested:828 Testing Accuracy:85.9%Progress:82.8% Speed(reviews/sec):1273. #Correct:713 #Tested:829 Testing Accuracy:86.0%Progress:82.9% Speed(reviews/sec):1273. #Correct:714 #Tested:830 Testing Accuracy:86.0%Progress:83.0% Speed(reviews/sec):1273. #Correct:715 #Tested:831 Testing Accuracy:86.0%Progress:83.1% Speed(reviews/se

## Further Noise Reduction

In [140]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common(10)

[('edie', 4.6913478822291435),
 ('paulie', 4.07753744390572),
 ('felix', 3.152736022363656),
 ('polanski', 2.8233610476132043),
 ('matthau', 2.80672172860924),
 ('victoria', 2.681021528714291),
 ('mildred', 2.6026896854443837),
 ('gandhi', 2.538973871058276),
 ('flawless', 2.451005098112319),
 ('superbly', 2.26002547857525)]

In [141]:
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:10]

[('boll', -4.969813299576001),
 ('uwe', -4.624972813284271),
 ('seagal', -3.644143560272545),
 ('unwatchable', -3.258096538021482),
 ('stinker', -3.2088254890146994),
 ('mst', -2.9502698994772336),
 ('incoherent', -2.9368917735310576),
 ('unfunny', -2.6922395950755678),
 ('waste', -2.6193845640165536),
 ('blah', -2.5704288232261625)]

## Reducing Noise by Strategically Reducing the Vocabulary

* Modify `pre_process_data`:
>* Add two additional parameters: `min_count` and `polarity_cutoff`
>* Calculate the positive-to-negative ratios of words used in the reviews. (You can use code you've written elsewhere in the notebook, but we are moving it into the class like we did with other helper code earlier.)
>* Change so words are only added to the vocabulary if they occur in the vocabulary more than `min_count` times.
>* Change so words are only added to the vocabulary if the absolute value of their postive-to-negative ratio is at least `polarity_cutoff`
* Modify `__init__`:
>* Add the same two parameters (`min_count` and `polarity_cutoff`) and use them when you call `pre_process_data`

In [142]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [143]:
hist, edges = np.histogram(list(map(lambda x:x[1],pos_neg_ratios.most_common())), density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Word Positive/Negative Affinity Distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

  """Entry point for launching an IPython kernel.


In [144]:
frequency_frequency = Counter()

for word, cnt in total_counts.most_common():
    frequency_frequency[cnt] += 1

In [145]:
hist, edges = np.histogram(list(map(lambda x:x[1],frequency_frequency.most_common())), density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="The frequency distribution of the words in our corpus")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

  """Entry point for launching an IPython kernel.


In [146]:
import time
import sys
import numpy as np

# Encapsulate our neural network in a class
class SentimentNetwork:
    def __init__(self, reviews, labels, min_count = 10, polarity_cutoff = 0.1, hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """
        # Assign a seed to our random number generator to ensure we get
        # reproducable results during development 
        np.random.seed(1)

        # process the reviews and their associated labels so that everything
        # is ready for training
        self.pre_process_data(reviews, labels, polarity_cutoff, min_count)
        
        # Build the network to have the number of hidden nodes and the learning rate that
        # were passed into this initializer. Make the same number of input nodes as
        # there are vocabulary words and create a single output node.
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels, min_count, polarity_cutoff):
        
        positive_counts = Counter()
        negative_counts = Counter()
        total_counts = Counter()

        for i in range(len(reviews)):
            if(labels[i] == 'POSITIVE'):
                for word in reviews[i].split(" "):
                    positive_counts[word] += 1
                    total_counts[word] += 1
            else:
                for word in reviews[i].split(" "):
                    negative_counts[word] += 1
                    total_counts[word] += 1

        pos_neg_ratios = Counter()

        for term,cnt in list(total_counts.most_common()):
            if(cnt >= 50):
                pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
                pos_neg_ratios[term] = pos_neg_ratio

        for word,ratio in pos_neg_ratios.most_common():
            if(ratio > 1):
                pos_neg_ratios[word] = np.log(ratio)
            else:
                pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))

        # populate review_vocab with all of the words in the given reviews
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                ## Only add words that occur at least min_count times
                #  and for words with pos/neg ratios, only add words
                #  that meet the polarity_cutoff
                if(total_counts[word] > min_count):
                    if(word in pos_neg_ratios.keys()):
                        if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):
                            review_vocab.add(word)
                    else:
                        review_vocab.add(word)

        # Convert the vocabulary set to a list so we can access words via indices
        self.review_vocab = review_vocab
        
        # populate label_vocab with all of the words in the given labels.
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        # Convert the label vocabulary set to a list so we can access labels via indices
        self.label_vocab = label_vocab
        
        # Store the sizes of the review and label vocabularies.
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        # Create a dictionary of words in the vocabulary mapped to index positions
        self.word2index = {}
        for i,word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        # Create a dictionary of labels mapped to index positions
        self.label2index = {}
        for i,label in enumerate(self.label_vocab):
            self.label2index[label] = i
        
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Store the number of nodes in input, hidden, and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Store the learning rate
        self.learning_rate = learning_rate

        # Initialize weights
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0,self.hidden_nodes**-0.5,(self.hidden_nodes,self.output_nodes))
        self.layer_1 = np.zeros((1,hidden_nodes))            
                
    def get_target_for_label(self,label):
        """Convert a label to `0` or `1`.
        Args:
            label(string) - Either "POSITIVE" or "NEGATIVE".
        Returns:
            `0` or `1`.
        """
        
        if(label == "POSITIVE"):
            return 1
        else:
            return 0
        
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)

    
    def train(self, training_reviews_raw, training_labels):
        training_reviews = list()
        for review in training_reviews_raw:
            indices = set()
            for word in review.split(" "):
                if(word in self.word2index.keys()):
                    indices.add(self.word2index[word])
            training_reviews.append(list(indices))

        # make sure out we have a matching number of reviews and labels
        assert(len(training_reviews_raw) == len(training_labels))
        
        # Keep track of correct predictions to display accuracy during training 
        correct_so_far = 0
        
        # Remember when we started for printing time statistics
        start = time.time()

        # loop through all the given reviews and run a forward and backward pass,
        # updating weights for every item
        for i in range(len(training_reviews)):
            
            # Get the next review and its correct label
            review = training_reviews[i]
            label = training_labels[i]
            
            # Hidden layer
            self.layer_1 *= 0
            for index in review:
                self.layer_1 += self.weights_0_1[index]
            
            # Output layer:
            layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
            
            # Output error: 
            layer_2_error = layer_2 - self.get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
            
            # Backprogagated error:
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error
            
            # Update weights:
            self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step (MINUS SIGN)
            
            for index in review:                        
                self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step (MINUS SIGN)
            
            # Keep track of correct predictions
            
            if (layer_2 >= 0.5) & (label == "POSITIVE"):
                correct_so_far += 1
            elif (layer_2 < 0.5) & (label == "NEGATIVE"):
                correct_so_far += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the training process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            if (i % 1000 == 0):
                sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                                 + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                                 + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                                 + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        
        # keep track of how many correct predictions we make
        correct = 0

        # we'll time how many predictions per second we make
        start = time.time()

        # Loop through each of the given reviews and call run to predict
        # its label. 
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the prediction process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")

    
    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """
        self.layer_1 *= 0
        unique_indices = set()
        for word in review.lower().split(" "):
            if word in self.word2index.keys():
                unique_indices.add(self.word2index[word])
        for index in unique_indices:
            self.layer_1 += self.weights_0_1[index]                            
            
        # Output layer:
        layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"

Run the following cell to train your network with a small polarity cutoff.

In [147]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.05,learning_rate=0.01)

In [148]:
mlp.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:4.16% Speed(reviews/sec):9260. #Correct:748 #Trained:1001 Training Accuracy:74.7%
Progress:8.33% Speed(reviews/sec):9976. #Correct:1454 #Trained:2001 Training Accuracy:72.6%
Progress:12.5% Speed(reviews/sec):9522. #Correct:2189 #Trained:3001 Training Accuracy:72.9%
Progress:16.6% Speed(reviews/sec):9554. #Correct:2896 #Trained:4001 Training Accuracy:72.3%
Progress:20.8% Speed(reviews/sec):9071. #Correct:3605 #Trained:5001 Training Accuracy:72.0%
Progress:25.0% Speed(reviews/sec):8664. #Correct:4302 #Trained:6001 Training Accuracy:71.6%
Progress:29.1% Speed(reviews/sec):8585. #Correct:5032 #Trained:7001 Training Accuracy:71.8%
Progress:33.3% Speed(reviews/sec):8700. #Correct:5753 #Trained:8001 Training Accuracy:71.9%
Progress:37.5% Speed(reviews/sec):8847. #Correct:6486 #Trained:9001 Training Accuracy:72.0%
Progress:41.6% Speed(reviews/sec):8924. #Correct:7194 #Trained:10001 Training Accuracy:71.

And run the following cell to test it's performance. It should be 

In [149]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Tested:1 Testing Accuracy:100.%Progress:0.1% Speed(reviews/sec):1919. #Correct:2 #Tested:2 Testing Accuracy:100.%Progress:0.2% Speed(reviews/sec):3225. #Correct:3 #Tested:3 Testing Accuracy:100.%Progress:0.3% Speed(reviews/sec):4161. #Correct:4 #Tested:4 Testing Accuracy:100.%Progress:0.4% Speed(reviews/sec):5133. #Correct:4 #Tested:5 Testing Accuracy:80.0%Progress:0.5% Speed(reviews/sec):5726. #Correct:5 #Tested:6 Testing Accuracy:83.3%Progress:0.6% Speed(reviews/sec):5837. #Correct:6 #Tested:7 Testing Accuracy:85.7%Progress:0.7% Speed(reviews/sec):6064. #Correct:7 #Tested:8 Testing Accuracy:87.5%Progress:0.8% Speed(reviews/sec):6294. #Correct:7 #Tested:9 Testing Accuracy:77.7%Progress:0.9% Speed(reviews/sec):4225. #Correct:8 #Tested:10 Testing Accuracy:80.0%Progress:1.0% Speed(reviews/sec):3957. #Correct:8 #Tested:11 Testing Accuracy:72.7%Progress:1.1% Speed(reviews/sec):4095. #Correct:9 #Tested:12 Testing Accuracy:75.0%Pr

Progress:85.3% Speed(reviews/sec):3583. #Correct:549 #Tested:854 Testing Accuracy:64.2%Progress:85.4% Speed(reviews/sec):3585. #Correct:549 #Tested:855 Testing Accuracy:64.2%Progress:85.5% Speed(reviews/sec):3587. #Correct:549 #Tested:856 Testing Accuracy:64.1%Progress:85.6% Speed(reviews/sec):3588. #Correct:549 #Tested:857 Testing Accuracy:64.0%Progress:85.7% Speed(reviews/sec):3589. #Correct:549 #Tested:858 Testing Accuracy:63.9%Progress:85.8% Speed(reviews/sec):3589. #Correct:549 #Tested:859 Testing Accuracy:63.9%Progress:85.9% Speed(reviews/sec):3590. #Correct:550 #Tested:860 Testing Accuracy:63.9%Progress:86.0% Speed(reviews/sec):3591. #Correct:550 #Tested:861 Testing Accuracy:63.8%Progress:86.1% Speed(reviews/sec):3592. #Correct:551 #Tested:862 Testing Accuracy:63.9%Progress:86.2% Speed(reviews/sec):3594. #Correct:551 #Tested:863 Testing Accuracy:63.8%Progress:86.3% Speed(reviews/sec):3591. #Correct:551 #Tested:864 Testing Accuracy:63.7%Progress:86.4% Speed(reviews/se

Run the following cell to train your network with a much larger polarity cutoff.

In [150]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.8,learning_rate=0.01)

And run the following cell to test it's performance.

In [151]:
mlp.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:4.16% Speed(reviews/sec):7328. #Correct:748 #Trained:1001 Training Accuracy:74.7%
Progress:8.33% Speed(reviews/sec):8598. #Correct:1454 #Trained:2001 Training Accuracy:72.6%
Progress:12.5% Speed(reviews/sec):7986. #Correct:2189 #Trained:3001 Training Accuracy:72.9%
Progress:16.6% Speed(reviews/sec):7519. #Correct:2896 #Trained:4001 Training Accuracy:72.3%
Progress:20.8% Speed(reviews/sec):6779. #Correct:3605 #Trained:5001 Training Accuracy:72.0%
Progress:25.0% Speed(reviews/sec):6908. #Correct:4302 #Trained:6001 Training Accuracy:71.6%
Progress:29.1% Speed(reviews/sec):6698. #Correct:5032 #Trained:7001 Training Accuracy:71.8%
Progress:33.3% Speed(reviews/sec):6942. #Correct:5753 #Trained:8001 Training Accuracy:71.9%
Progress:37.5% Speed(reviews/sec):7170. #Correct:6486 #Trained:9001 Training Accuracy:72.0%
Progress:41.6% Speed(reviews/sec):7046. #Correct:7194 #Trained:10001 Training Accuracy:71.

In [152]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Tested:1 Testing Accuracy:100.%Progress:0.1% Speed(reviews/sec):1464. #Correct:2 #Tested:2 Testing Accuracy:100.%Progress:0.2% Speed(reviews/sec):2375. #Correct:3 #Tested:3 Testing Accuracy:100.%Progress:0.3% Speed(reviews/sec):2964. #Correct:4 #Tested:4 Testing Accuracy:100.%Progress:0.4% Speed(reviews/sec):3285. #Correct:4 #Tested:5 Testing Accuracy:80.0%Progress:0.5% Speed(reviews/sec):3272. #Correct:5 #Tested:6 Testing Accuracy:83.3%Progress:0.6% Speed(reviews/sec):2571. #Correct:6 #Tested:7 Testing Accuracy:85.7%Progress:0.7% Speed(reviews/sec):2421. #Correct:7 #Tested:8 Testing Accuracy:87.5%Progress:0.8% Speed(reviews/sec):2515. #Correct:7 #Tested:9 Testing Accuracy:77.7%Progress:0.9% Speed(reviews/sec):2055. #Correct:8 #Tested:10 Testing Accuracy:80.0%Progress:1.0% Speed(reviews/sec):1890. #Correct:8 #Tested:11 Testing Accuracy:72.7%Progress:1.1% Speed(reviews/sec):1973. #Correct:9 #Tested:12 Testing Accuracy:75.0%Pr

Progress:84.2% Speed(reviews/sec):3994. #Correct:543 #Tested:843 Testing Accuracy:64.4%Progress:84.3% Speed(reviews/sec):3992. #Correct:544 #Tested:844 Testing Accuracy:64.4%Progress:84.4% Speed(reviews/sec):3994. #Correct:545 #Tested:845 Testing Accuracy:64.4%Progress:84.5% Speed(reviews/sec):3996. #Correct:545 #Tested:846 Testing Accuracy:64.4%Progress:84.6% Speed(reviews/sec):3996. #Correct:546 #Tested:847 Testing Accuracy:64.4%Progress:84.7% Speed(reviews/sec):3996. #Correct:546 #Tested:848 Testing Accuracy:64.3%Progress:84.8% Speed(reviews/sec):3997. #Correct:546 #Tested:849 Testing Accuracy:64.3%Progress:84.9% Speed(reviews/sec):3992. #Correct:547 #Tested:850 Testing Accuracy:64.3%Progress:85.0% Speed(reviews/sec):3968. #Correct:548 #Tested:851 Testing Accuracy:64.3%Progress:85.1% Speed(reviews/sec):3967. #Correct:548 #Tested:852 Testing Accuracy:64.3%Progress:85.2% Speed(reviews/sec):3960. #Correct:548 #Tested:853 Testing Accuracy:64.2%Progress:85.3% Speed(reviews/se

## So what's Going on in the Weights?

In [153]:
mlp_full = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=0,polarity_cutoff=0,learning_rate=0.01)

In [154]:
mlp_full.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:4.16% Speed(reviews/sec):1318. #Correct:741 #Trained:1001 Training Accuracy:74.0%
Progress:8.33% Speed(reviews/sec):1384. #Correct:1530 #Trained:2001 Training Accuracy:76.4%
Progress:12.5% Speed(reviews/sec):1361. #Correct:2377 #Trained:3001 Training Accuracy:79.2%
Progress:16.6% Speed(reviews/sec):1355. #Correct:3188 #Trained:4001 Training Accuracy:79.6%
Progress:20.8% Speed(reviews/sec):1340. #Correct:4003 #Trained:5001 Training Accuracy:80.0%
Progress:25.0% Speed(reviews/sec):1345. #Correct:4829 #Trained:6001 Training Accuracy:80.4%
Progress:29.1% Speed(reviews/sec):1342. #Correct:5690 #Trained:7001 Training Accuracy:81.2%
Progress:33.3% Speed(reviews/sec):1344. #Correct:6548 #Trained:8001 Training Accuracy:81.8%
Progress:37.5% Speed(reviews/sec):1343. #Correct:7404 #Trained:9001 Training Accuracy:82.2%
Progress:41.6% Speed(reviews/sec):1347. #Correct:8273 #Trained:10001 Training Accuracy:82.

In [155]:
def get_most_similar_words(focus = "horrible"):
    most_similar = Counter()

    for word in mlp_full.word2index.keys():
        most_similar[word] = np.dot(mlp_full.weights_0_1[mlp_full.word2index[word]],mlp_full.weights_0_1[mlp_full.word2index[focus]])
    
    return most_similar.most_common()

In [156]:
get_most_similar_words("excellent")[:10]

[('excellent', 0.14660809065567862),
 ('perfect', 0.12526627358447937),
 ('great', 0.1071386892352669),
 ('amazing', 0.10165619345416502),
 ('wonderful', 0.09706521876360932),
 ('best', 0.09633584659210816),
 ('today', 0.09062129241360575),
 ('fun', 0.08856844988594424),
 ('loved', 0.07910638697206201),
 ('definitely', 0.07691091698161268)]

In [157]:
get_most_similar_words("terrible")[:10]

[('worst', 0.17615579852418917),
 ('awful', 0.12575196125185908),
 ('waste', 0.11991666736024725),
 ('poor', 0.10184592068658213),
 ('boring', 0.09740082138736945),
 ('terrible', 0.09719201719001999),
 ('bad', 0.08194195157424214),
 ('dull', 0.0812879868188303),
 ('worse', 0.07505502158320543),
 ('poorly', 0.07495944377181996)]

In [158]:
import matplotlib.colors as colors

words_to_visualize = list()
for word, ratio in pos_neg_ratios.most_common(500):
    if(word in mlp_full.word2index.keys()):
        words_to_visualize.append(word)
    
for word, ratio in list(reversed(pos_neg_ratios.most_common()))[0:500]:
    if(word in mlp_full.word2index.keys()):
        words_to_visualize.append(word)

In [159]:
pos = 0
neg = 0

colors_list = list()
vectors_list = list()
for word in words_to_visualize:
    if word in pos_neg_ratios.keys():
        vectors_list.append(mlp_full.weights_0_1[mlp_full.word2index[word]])
        if(pos_neg_ratios[word] > 0):
            pos+=1
            colors_list.append("#00ff00")
        else:
            neg+=1
            colors_list.append("#000000")

In [160]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(vectors_list)

In [161]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="vector T-SNE for most polarized words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_to_visualize,
                                    color=colors_list))

p.scatter(x="x1", y="x2", size=8, source=source, fill_color="color")

word_labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(word_labels)

show(p)

# Green indicates positive words, black indicates negative words