# Logistic Regression for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


In [4]:
import pandas as pd
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


Let us shuffle the class labels.

In [5]:
import numpy as np
## uncomment these lines if you have dowloaded the original file:
#np.random.seed(0)
#df = df.reindex(np.random.permutation(df.index))
#df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)

## Preprocessing Text Data

Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [6]:
import numpy as np
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Let's give it at try:

In [8]:
tokenizer('This :) is a <a> test Israel! :-)</br>')

['test', 'israel', ':)', ':)']

## Learning (SciKit)

First, we define a generator that returns the document body and the corresponding class label:

In [9]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

In [10]:
next(stream_docs(path='shuffled_movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [11]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

In [12]:
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

# Exercise 1: define features based on word embeddings (pre-trained word2vec / Glove/Fastext emebddings can be used)
# Define suitable d dimension, and sequence length

In [14]:
#Primero creamos el corpus y lo almacenamos
from nltk.util import ngrams
from nltk import FreqDist

n = 50000
corpus_words = []
for i in range(n):
    corpus_words += [tokenizer(df.iloc[i]['review'])]
    if i %10000 == 0:
        print("Completed ",20*i/10000,"%")  
print(corpus_words)

Completed  0.0 %
Completed  20.0 %
Completed  40.0 %
Completed  60.0 %
Completed  80.0 %


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [18]:
# Luego generamos el modelo usando word2vec
from gensim.models import Word2Vec

model = Word2Vec(corpus_words,size=100)
#print(model)
w2v = dict(zip(model.wv.index2word,model.wv.vectors))
print(w2v)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [19]:
#Creamos un vector de medias de los embeddings
def MeanEmbeddingVectorizer(tokenized_review):
    mean_vector = []
    for word in tokenized_review:
        if word in w2v:
            mean_vector.append(w2v[word].tolist())
        else:
            mean_vector.append(np.zeros(100))
    mean_vector = np.mean(mean_vector,axis = 0)
    #print(mean_vector)
    return mean_vector

# Actualizamos la data con la pre-procesada
data = []
for i in range(50000):
    data.append(MeanEmbeddingVectorizer(corpus_words[i]))
data = np.array(data)
print(data.shape)

target = np.array([df['sentiment']])

target = target.T
print(target[:5],target.shape) # Verificamos correspondencia

(50000, 100)
[[1]
 [0]
 [0]
 [1]
 [0]] (50000, 1)


In [20]:
#Dividimos la data en entrenaiento y test
training_data = data[:45000]
training_target = target[:45000]

test_data = data[45000:]
test_target = target[45000:]

In [21]:
#Programamos la red neuronal
def sigmoid(x,derivative = False):
    if derivative:
        return sigmoid(x)*(1 - sigmoid(x))
    return 1.0 / (1.0 + np.exp(-x))

def sigmoid_prime(x):
    """ Funcion para calcular la derivada del sigmoide cuando x es sig(x)"""
    return x * (1 - x)

def evalPrediction(x,threshold,target):
    """Funcion que evalua el acierto de una salida, Devuelve un booleano"""
    if x > threshold:
        return target == 1
    else:
        return target == 0
        
sig_PrimeVectorized = np.vectorize(sigmoid_prime)
sigVectorized = np.vectorize(sigmoid)
evalPredictionVectorized = np.vectorize(evalPrediction)

class Weights(object):
    def __init__(self, numFeatures,numNeurons):
        self.numFeatures =  numFeatures
        self.numNeurons = numNeurons
        self.weights = np.random.random((numFeatures,numNeurons))
    
    def printWeights(self):
        print(self.weights)

class NeuralNetwork(object):
    def __init__(self,neurons,data,target,_test_data,_test_target):
        # We include the bias at the start of data matrix
        self.training_data = data
        self.n = data.shape[0] #Numero de ejemplos en el training
        self. numFeatures = self.training_data.shape[1]
        self.training_target = target
        # Input Layer
        self.inLayer = None
        # First Weight Matrix(nro col = nro de neuronas)
        self.neuronsInHiddenLayer = neurons
        # acordemonos que un feature adicional seria el bias
        self.weights_1 = np.random.random((self.numFeatures+1,self.neuronsInHiddenLayer))
        # Hidden Layers
        self.hiddenLayer = None        
        # Second Weight Matrix, Agregar una columna para el bias
        self.weights_2 = np.random.random((self.neuronsInHiddenLayer + 1,1)) # Solo una neurona de salida
        # Output Layer
        self.outLayer = None
        self.Layers = None
        
        # Loading Test Data, (Agregamos una fila de 1 para el bias)
        _test_data = np.append( np.full((len(_test_data),1),1),_test_data,axis = 1)
        self.test_data = _test_data
        self.test_target = _test_target
        
        ## Labels para imprimir los arreglos
        self.enum = {
                0:"Input Data:\n",
                1:"Weights 1:\n",
                2:"Hidden Layer:\n",
                3:"Weights 2:\n",
                4:"Output Layer:\n",}
        
    def forwardPropagation(self,i):
        # Seleccionamos la cantidad de datos
        self.inLayer = np.array([self.training_data[i%self.n]])
        #self.inLayer = np.array(self.training_data)
        self.inLayer = np.append( np.full((len(self.inLayer),1),1),self.inLayer,axis = 1)
        
        self.hiddenLayer = np.dot(self.inLayer,self.weights_1)
        self.hiddenLayer = sigmoid(self.hiddenLayer)
        self.hiddenLayer = np.append( np.full((len(self.hiddenLayer),1),1),self.hiddenLayer,axis = 1)
        
        self.outLayer = np.dot(self.hiddenLayer,self.weights_2)
        self.outLayer = sigmoid(self.outLayer)
        
        self.updateLayers()
        
    def updateLayers(self):
        self.Layers = [self.inLayer,self.weights_1,self.hiddenLayer,self.weights_2,self.outLayer]
        
    def backPropagation(self,i,learning_rate):
        error = (self.outLayer[0][0] - self.training_target[i%self.n])
        weights_1 = self.weights_1
        weights_2 = self.weights_2
        weights_2 = weights_2 - learning_rate * error * self.hiddenLayer.T
        hidden = self.hiddenLayer.T 
        hidden = np.delete(hidden,(0),axis = 0)
        hidden = sig_PrimeVectorized(hidden).T
        derivative = error * np.dot(np.array([self.inLayer[0]]).T,hidden)
        w2 = self.weights_2
        w2 = np.delete(w2,(0),axis=0)
        w2 = np.tile(w2,self.neuronsInHiddenLayer)
        derivative = np.dot(derivative,w2)
        weights_1 = weights_1 - learning_rate * derivative
        self.weights_1 = weights_1
        self.weights_2 = weights_2
        self.updateLayers()

    def Train(self,iterations,learn_rate):
        alpha = learn_rate
        b= True
        for i in range(iterations):
            self.forwardPropagation(i)
            self.backPropagation(i,alpha)
            #self.printNN()
            if(i % 10000 == 0):
                print("Progress:",i / 10000,"%")
                if(self.getAccuracy() > 70 and b):
                    alpha /= 10
                    b = False
        self.printNN()
        
    def printNN(self):
        print("#################")
        i = 0
        for layer in self.Layers:
            print(self.enum[i],layer)
            i +=1
        print("#################")
    def getAccuracy(self):
        prediction = np.dot(self.test_data,self.weights_1)
        prediction = sigmoid(prediction)
        prediction = np.append( np.full((len(prediction),1),1),prediction,axis = 1)
        prediction = np.dot(prediction,self.weights_2)
        prediction = sigmoid(prediction)
        threshold = 0.5      
        xs = evalPredictionVectorized(prediction,threshold,self.test_target)
        numSuccesses = np.count_nonzero(xs == True)
        print("Acc:", numSuccesses / len(xs) * 100)
        return numSuccesses / len(xs) * 100

In [22]:
# Corremos la red neuronal en base a la data pre-procesada y calculamos el Acc.
NN = NeuralNetwork(32,training_data,training_target,test_data,test_target)
NN.Train(1000000,0.01)
NN.getAccuracy()

Progress: 0.0 %
Acc: 50.63999999999999
Progress: 1.0 %
Acc: 78.97999999999999
Progress: 2.0 %
Acc: 80.46
Progress: 3.0 %
Acc: 81.86
Progress: 4.0 %
Acc: 82.96
Progress: 5.0 %
Acc: 83.48
Progress: 6.0 %
Acc: 83.78
Progress: 7.0 %
Acc: 83.96000000000001
Progress: 8.0 %
Acc: 84.46000000000001
Progress: 9.0 %
Acc: 84.38
Progress: 10.0 %
Acc: 84.89999999999999
Progress: 11.0 %
Acc: 85.02
Progress: 12.0 %
Acc: 85.24000000000001
Progress: 13.0 %
Acc: 85.34
Progress: 14.0 %
Acc: 85.44
Progress: 15.0 %
Acc: 85.52
Progress: 16.0 %
Acc: 85.72
Progress: 17.0 %
Acc: 85.86
Progress: 18.0 %
Acc: 85.84
Progress: 19.0 %
Acc: 85.86
Progress: 20.0 %
Acc: 86.06
Progress: 21.0 %
Acc: 86.0
Progress: 22.0 %
Acc: 86.08
Progress: 23.0 %
Acc: 86.18
Progress: 24.0 %
Acc: 86.14
Progress: 25.0 %
Acc: 86.2
Progress: 26.0 %
Acc: 86.3
Progress: 27.0 %
Acc: 86.22
Progress: 28.0 %
Acc: 86.32
Progress: 29.0 %
Acc: 86.22
Progress: 30.0 %
Acc: 86.42
Progress: 31.0 %
Acc: 86.4
Progress: 32.0 %
Acc: 86.36
Progress: 33.0 %
A

86.68