# Aplicación de la Regresión Logística para el Análisis de sentimientos

Adaptado por http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

<br>
<br>

## The IMDb Movie Review Dataset

En esta sección se entrenó una regresión logística para clasificar opiniones de un dataset de 50K IMDb  recolectado por Maas el. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

La base de datos consiste en 50,000 opiniones de películas del original "entrenamiento" y "testeo". Las etiquetas son binarias y contienen 25,000 comentarios positivos y 25,000 comentarios negativos.

Let us shuffle the class labels.

<br>
<br>

## Preprocesamiento de la Información

In [1]:
# Firstly, please note that the performance of google word2vec is better on big datasets. 
# In this example we are considering only 25000 training examples from the imdb dataset.
# Therefore, the performance is similar to the "bag of words" model.

# Importing libraries
import numpy as np
import pandas as pd
# BeautifulSoup is used to remove html tags from the text
from bs4 import BeautifulSoup 
import re # For regular expressions

# Stopwords can be useful to undersand the semantics of the sentence.
# Therefore stopwords are not removed while creating the word2vec model.
# But they will be removed  while averaging feature vectors.
from nltk.corpus import stopwords

In [2]:
df = pd.read_csv('shuffled_movie_data.csv')

In [3]:
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


In [4]:
# Convirtiendo el texto a una secuencia de palabras
def review_wordlist(review, remove_stopwords=False):
    # 1. removiendo html tags
    review_text = BeautifulSoup(review).get_text()
    # 2. considerando las letras
    review_text = re.sub("[^a-zA-Z]"," ",review_text)
    # 3. convirtiendo todo a lower-case
    words = review_text.lower().split()
    # 4. remover stop-words
    if remove_stopwords:
        stops = set(stopwords.words("english"))     
        words = [w for w in words if not w in stops]
    
    return(words)

In [5]:
import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [6]:
# Funcion Tokenizer
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    text = re.sub('!', ' ! ', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    #text = re.sub('[\W]+' + exclamation + ']', ' ', text) + ' '.join(emoticons).replace('-', '')
    text=text.split()
    #text = [w for w in text.split() if w not in stop]
    #tokenized = [porter.stem(w) for w in text]
    return text

In [7]:
# Esta funcion divide los comntarios en oraciones
def review_sentences(review, tokenizer, remove_stopwords=False):
    raw_sentences = tokenizer(review)
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence)>0:
            sentences.append(review_wordlist(raw_sentence,\
                                            remove_stopwords))

    return sentences

In [8]:
df["review"][0]

'In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich famil

In [9]:
sentences = []
print("Parsing sentences from training set")
for review in df["review"]:
    sentences += review_sentences(review, tokenizer)

Parsing sentences from training set


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beauti

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that d

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that d

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Sou

In [25]:
save = pd.Series(sentences)
df.to_csv("datos que toman mucho tiempo en cargar")

## Extracción de las características del texto usando word2vector

In [26]:
# Importing the built-in logging module
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [27]:
# Creating the model and setting values for the various parameters
num_features = 10  # dimension del vector de palabra
min_word_count = 40 # cantidad minima de conteo
num_workers = 4     # hilos
context = 10        # window size
downsampling = 1e-3 # (0.001) Downsample setting for frequent words

# Initializing the train model
from gensim.models import word2vec
print("Training model....")
model = word2vec.Word2Vec(sentences,workers=num_workers,size=num_features,min_count=min_word_count, window=context, sample=downsampling)

model.init_sims(replace=True)

model_name = "miModelo"
model.save(model_name)

2018-11-22 11:32:21,527 : INFO : 'pattern' package not found; tag filters are not available for English
2018-11-22 11:32:21,536 : INFO : collecting all words and their counts
2018-11-22 11:32:21,537 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-22 11:32:21,545 : INFO : PROGRESS: at sentence #10000, processed 10218 words, keeping 2582 word types
2018-11-22 11:32:21,555 : INFO : PROGRESS: at sentence #20000, processed 20503 words, keeping 4279 word types
2018-11-22 11:32:21,567 : INFO : PROGRESS: at sentence #30000, processed 30704 words, keeping 5483 word types
2018-11-22 11:32:21,579 : INFO : PROGRESS: at sentence #40000, processed 40952 words, keeping 6665 word types
2018-11-22 11:32:21,588 : INFO : PROGRESS: at sentence #50000, processed 51224 words, keeping 7506 word types
2018-11-22 11:32:21,601 : INFO : PROGRESS: at sentence #60000, processed 61525 words, keeping 8453 word types
2018-11-22 11:32:21,612 : INFO : PROGRESS: at sentence #70000, pro

Training model....


2018-11-22 11:32:21,752 : INFO : PROGRESS: at sentence #210000, processed 215673 words, keeping 17425 word types
2018-11-22 11:32:21,760 : INFO : PROGRESS: at sentence #220000, processed 225938 words, keeping 17887 word types
2018-11-22 11:32:21,769 : INFO : PROGRESS: at sentence #230000, processed 236182 words, keeping 18319 word types
2018-11-22 11:32:21,777 : INFO : PROGRESS: at sentence #240000, processed 246428 words, keeping 18672 word types
2018-11-22 11:32:21,786 : INFO : PROGRESS: at sentence #250000, processed 256688 words, keeping 19033 word types
2018-11-22 11:32:21,798 : INFO : PROGRESS: at sentence #260000, processed 266872 words, keeping 19416 word types
2018-11-22 11:32:21,806 : INFO : PROGRESS: at sentence #270000, processed 277153 words, keeping 19791 word types
2018-11-22 11:32:21,816 : INFO : PROGRESS: at sentence #280000, processed 287442 words, keeping 20148 word types
2018-11-22 11:32:21,826 : INFO : PROGRESS: at sentence #290000, processed 297608 words, keeping 

2018-11-22 11:32:22,402 : INFO : PROGRESS: at sentence #940000, processed 964793 words, keeping 35695 word types
2018-11-22 11:32:22,411 : INFO : PROGRESS: at sentence #950000, processed 975110 words, keeping 35850 word types
2018-11-22 11:32:22,419 : INFO : PROGRESS: at sentence #960000, processed 985317 words, keeping 36015 word types
2018-11-22 11:32:22,427 : INFO : PROGRESS: at sentence #970000, processed 995538 words, keeping 36191 word types
2018-11-22 11:32:22,435 : INFO : PROGRESS: at sentence #980000, processed 1005803 words, keeping 36355 word types
2018-11-22 11:32:22,443 : INFO : PROGRESS: at sentence #990000, processed 1016011 words, keeping 36487 word types
2018-11-22 11:32:22,450 : INFO : PROGRESS: at sentence #1000000, processed 1026244 words, keeping 36634 word types
2018-11-22 11:32:22,459 : INFO : PROGRESS: at sentence #1010000, processed 1036533 words, keeping 36775 word types
2018-11-22 11:32:22,468 : INFO : PROGRESS: at sentence #1020000, processed 1046704 words, 

2018-11-22 11:32:23,019 : INFO : PROGRESS: at sentence #1660000, processed 1703291 words, keeping 45733 word types
2018-11-22 11:32:23,028 : INFO : PROGRESS: at sentence #1670000, processed 1713508 words, keeping 45862 word types
2018-11-22 11:32:23,036 : INFO : PROGRESS: at sentence #1680000, processed 1723718 words, keeping 46039 word types
2018-11-22 11:32:23,044 : INFO : PROGRESS: at sentence #1690000, processed 1733982 words, keeping 46176 word types
2018-11-22 11:32:23,052 : INFO : PROGRESS: at sentence #1700000, processed 1744271 words, keeping 46293 word types
2018-11-22 11:32:23,060 : INFO : PROGRESS: at sentence #1710000, processed 1754579 words, keeping 46410 word types
2018-11-22 11:32:23,069 : INFO : PROGRESS: at sentence #1720000, processed 1764817 words, keeping 46537 word types
2018-11-22 11:32:23,078 : INFO : PROGRESS: at sentence #1730000, processed 1775107 words, keeping 46641 word types
2018-11-22 11:32:23,085 : INFO : PROGRESS: at sentence #1740000, processed 17854

2018-11-22 11:32:23,621 : INFO : PROGRESS: at sentence #2380000, processed 2442052 words, keeping 53456 word types
2018-11-22 11:32:23,629 : INFO : PROGRESS: at sentence #2390000, processed 2452281 words, keeping 53549 word types
2018-11-22 11:32:23,637 : INFO : PROGRESS: at sentence #2400000, processed 2462569 words, keeping 53680 word types
2018-11-22 11:32:23,646 : INFO : PROGRESS: at sentence #2410000, processed 2472869 words, keeping 53772 word types
2018-11-22 11:32:23,655 : INFO : PROGRESS: at sentence #2420000, processed 2483128 words, keeping 53877 word types
2018-11-22 11:32:23,665 : INFO : PROGRESS: at sentence #2430000, processed 2493364 words, keeping 54019 word types
2018-11-22 11:32:23,673 : INFO : PROGRESS: at sentence #2440000, processed 2503615 words, keeping 54083 word types
2018-11-22 11:32:23,681 : INFO : PROGRESS: at sentence #2450000, processed 2513887 words, keeping 54170 word types
2018-11-22 11:32:23,693 : INFO : PROGRESS: at sentence #2460000, processed 25241

2018-11-22 11:32:24,246 : INFO : PROGRESS: at sentence #3100000, processed 3180159 words, keeping 59635 word types
2018-11-22 11:32:24,253 : INFO : PROGRESS: at sentence #3110000, processed 3190385 words, keeping 59713 word types
2018-11-22 11:32:24,263 : INFO : PROGRESS: at sentence #3120000, processed 3200667 words, keeping 59785 word types
2018-11-22 11:32:24,271 : INFO : PROGRESS: at sentence #3130000, processed 3210864 words, keeping 59862 word types
2018-11-22 11:32:24,279 : INFO : PROGRESS: at sentence #3140000, processed 3221156 words, keeping 59954 word types
2018-11-22 11:32:24,288 : INFO : PROGRESS: at sentence #3150000, processed 3231435 words, keeping 60034 word types
2018-11-22 11:32:24,295 : INFO : PROGRESS: at sentence #3160000, processed 3241661 words, keeping 60115 word types
2018-11-22 11:32:24,303 : INFO : PROGRESS: at sentence #3170000, processed 3251913 words, keeping 60182 word types
2018-11-22 11:32:24,311 : INFO : PROGRESS: at sentence #3180000, processed 32621

2018-11-22 11:32:24,845 : INFO : PROGRESS: at sentence #3820000, processed 3918448 words, keeping 64843 word types
2018-11-22 11:32:24,854 : INFO : PROGRESS: at sentence #3830000, processed 3928788 words, keeping 64907 word types
2018-11-22 11:32:24,863 : INFO : PROGRESS: at sentence #3840000, processed 3939042 words, keeping 64989 word types
2018-11-22 11:32:24,871 : INFO : PROGRESS: at sentence #3850000, processed 3949277 words, keeping 65046 word types
2018-11-22 11:32:24,879 : INFO : PROGRESS: at sentence #3860000, processed 3959551 words, keeping 65125 word types
2018-11-22 11:32:24,887 : INFO : PROGRESS: at sentence #3870000, processed 3969834 words, keeping 65177 word types
2018-11-22 11:32:24,896 : INFO : PROGRESS: at sentence #3880000, processed 3980105 words, keeping 65246 word types
2018-11-22 11:32:24,905 : INFO : PROGRESS: at sentence #3890000, processed 3990339 words, keeping 65319 word types
2018-11-22 11:32:24,913 : INFO : PROGRESS: at sentence #3900000, processed 40006

2018-11-22 11:32:25,466 : INFO : PROGRESS: at sentence #4540000, processed 4657331 words, keeping 69534 word types
2018-11-22 11:32:25,477 : INFO : PROGRESS: at sentence #4550000, processed 4667604 words, keeping 69595 word types
2018-11-22 11:32:25,487 : INFO : PROGRESS: at sentence #4560000, processed 4677876 words, keeping 69649 word types
2018-11-22 11:32:25,496 : INFO : PROGRESS: at sentence #4570000, processed 4688108 words, keeping 69701 word types
2018-11-22 11:32:25,507 : INFO : PROGRESS: at sentence #4580000, processed 4698393 words, keeping 69772 word types
2018-11-22 11:32:25,515 : INFO : PROGRESS: at sentence #4590000, processed 4708684 words, keeping 69829 word types
2018-11-22 11:32:25,523 : INFO : PROGRESS: at sentence #4600000, processed 4718965 words, keeping 69875 word types
2018-11-22 11:32:25,530 : INFO : PROGRESS: at sentence #4610000, processed 4729179 words, keeping 69961 word types
2018-11-22 11:32:25,539 : INFO : PROGRESS: at sentence #4620000, processed 47394

2018-11-22 11:32:26,089 : INFO : PROGRESS: at sentence #5260000, processed 5396213 words, keeping 73960 word types
2018-11-22 11:32:26,098 : INFO : PROGRESS: at sentence #5270000, processed 5406428 words, keeping 74009 word types
2018-11-22 11:32:26,108 : INFO : PROGRESS: at sentence #5280000, processed 5416665 words, keeping 74066 word types
2018-11-22 11:32:26,118 : INFO : PROGRESS: at sentence #5290000, processed 5426905 words, keeping 74122 word types
2018-11-22 11:32:26,126 : INFO : PROGRESS: at sentence #5300000, processed 5437191 words, keeping 74170 word types
2018-11-22 11:32:26,133 : INFO : PROGRESS: at sentence #5310000, processed 5447496 words, keeping 74211 word types
2018-11-22 11:32:26,141 : INFO : PROGRESS: at sentence #5320000, processed 5457709 words, keeping 74275 word types
2018-11-22 11:32:26,149 : INFO : PROGRESS: at sentence #5330000, processed 5467882 words, keeping 74323 word types
2018-11-22 11:32:26,158 : INFO : PROGRESS: at sentence #5340000, processed 54780

2018-11-22 11:32:26,709 : INFO : PROGRESS: at sentence #5980000, processed 6134280 words, keeping 77875 word types
2018-11-22 11:32:26,718 : INFO : PROGRESS: at sentence #5990000, processed 6144480 words, keeping 77944 word types
2018-11-22 11:32:26,727 : INFO : PROGRESS: at sentence #6000000, processed 6154729 words, keeping 78001 word types
2018-11-22 11:32:26,734 : INFO : PROGRESS: at sentence #6010000, processed 6164993 words, keeping 78068 word types
2018-11-22 11:32:26,745 : INFO : PROGRESS: at sentence #6020000, processed 6175285 words, keeping 78121 word types
2018-11-22 11:32:26,752 : INFO : PROGRESS: at sentence #6030000, processed 6185599 words, keeping 78175 word types
2018-11-22 11:32:26,761 : INFO : PROGRESS: at sentence #6040000, processed 6195883 words, keeping 78222 word types
2018-11-22 11:32:26,769 : INFO : PROGRESS: at sentence #6050000, processed 6206219 words, keeping 78274 word types
2018-11-22 11:32:26,779 : INFO : PROGRESS: at sentence #6060000, processed 62165

2018-11-22 11:32:27,318 : INFO : PROGRESS: at sentence #6700000, processed 6873170 words, keeping 81480 word types
2018-11-22 11:32:27,327 : INFO : PROGRESS: at sentence #6710000, processed 6883415 words, keeping 81527 word types
2018-11-22 11:32:27,334 : INFO : PROGRESS: at sentence #6720000, processed 6893697 words, keeping 81584 word types
2018-11-22 11:32:27,344 : INFO : PROGRESS: at sentence #6730000, processed 6904003 words, keeping 81637 word types
2018-11-22 11:32:27,353 : INFO : PROGRESS: at sentence #6740000, processed 6914201 words, keeping 81690 word types
2018-11-22 11:32:27,361 : INFO : PROGRESS: at sentence #6750000, processed 6924452 words, keeping 81735 word types
2018-11-22 11:32:27,368 : INFO : PROGRESS: at sentence #6760000, processed 6934707 words, keeping 81791 word types
2018-11-22 11:32:27,376 : INFO : PROGRESS: at sentence #6770000, processed 6944926 words, keeping 81842 word types
2018-11-22 11:32:27,385 : INFO : PROGRESS: at sentence #6780000, processed 69551

2018-11-22 11:32:27,928 : INFO : PROGRESS: at sentence #7420000, processed 7612191 words, keeping 85014 word types
2018-11-22 11:32:27,937 : INFO : PROGRESS: at sentence #7430000, processed 7622497 words, keeping 85059 word types
2018-11-22 11:32:27,946 : INFO : PROGRESS: at sentence #7440000, processed 7632710 words, keeping 85103 word types
2018-11-22 11:32:27,955 : INFO : PROGRESS: at sentence #7450000, processed 7642938 words, keeping 85136 word types
2018-11-22 11:32:27,963 : INFO : PROGRESS: at sentence #7460000, processed 7653206 words, keeping 85181 word types
2018-11-22 11:32:27,970 : INFO : PROGRESS: at sentence #7470000, processed 7663528 words, keeping 85219 word types
2018-11-22 11:32:27,981 : INFO : PROGRESS: at sentence #7480000, processed 7673726 words, keeping 85281 word types
2018-11-22 11:32:27,988 : INFO : PROGRESS: at sentence #7490000, processed 7684033 words, keeping 85346 word types
2018-11-22 11:32:27,996 : INFO : PROGRESS: at sentence #7500000, processed 76942

2018-11-22 11:32:28,582 : INFO : PROGRESS: at sentence #8140000, processed 8350741 words, keeping 88138 word types
2018-11-22 11:32:28,595 : INFO : PROGRESS: at sentence #8150000, processed 8360975 words, keeping 88191 word types
2018-11-22 11:32:28,608 : INFO : PROGRESS: at sentence #8160000, processed 8371203 words, keeping 88230 word types
2018-11-22 11:32:28,618 : INFO : PROGRESS: at sentence #8170000, processed 8381518 words, keeping 88276 word types
2018-11-22 11:32:28,631 : INFO : PROGRESS: at sentence #8180000, processed 8391818 words, keeping 88338 word types
2018-11-22 11:32:28,639 : INFO : PROGRESS: at sentence #8190000, processed 8402087 words, keeping 88397 word types
2018-11-22 11:32:28,648 : INFO : PROGRESS: at sentence #8200000, processed 8412337 words, keeping 88442 word types
2018-11-22 11:32:28,658 : INFO : PROGRESS: at sentence #8210000, processed 8422619 words, keeping 88505 word types
2018-11-22 11:32:28,669 : INFO : PROGRESS: at sentence #8220000, processed 84328

2018-11-22 11:32:29,295 : INFO : PROGRESS: at sentence #8860000, processed 9088889 words, keeping 91193 word types
2018-11-22 11:32:29,303 : INFO : PROGRESS: at sentence #8870000, processed 9099163 words, keeping 91248 word types
2018-11-22 11:32:29,311 : INFO : PROGRESS: at sentence #8880000, processed 9109392 words, keeping 91295 word types
2018-11-22 11:32:29,319 : INFO : PROGRESS: at sentence #8890000, processed 9119625 words, keeping 91341 word types
2018-11-22 11:32:29,328 : INFO : PROGRESS: at sentence #8900000, processed 9129871 words, keeping 91370 word types
2018-11-22 11:32:29,335 : INFO : PROGRESS: at sentence #8910000, processed 9140158 words, keeping 91412 word types
2018-11-22 11:32:29,343 : INFO : PROGRESS: at sentence #8920000, processed 9150420 words, keeping 91494 word types
2018-11-22 11:32:29,351 : INFO : PROGRESS: at sentence #8930000, processed 9160758 words, keeping 91525 word types
2018-11-22 11:32:29,361 : INFO : PROGRESS: at sentence #8940000, processed 91710

2018-11-22 11:32:29,946 : INFO : PROGRESS: at sentence #9580000, processed 9827689 words, keeping 94286 word types
2018-11-22 11:32:29,956 : INFO : PROGRESS: at sentence #9590000, processed 9837922 words, keeping 94324 word types
2018-11-22 11:32:29,965 : INFO : PROGRESS: at sentence #9600000, processed 9848224 words, keeping 94363 word types
2018-11-22 11:32:29,975 : INFO : PROGRESS: at sentence #9610000, processed 9858542 words, keeping 94387 word types
2018-11-22 11:32:29,985 : INFO : PROGRESS: at sentence #9620000, processed 9868730 words, keeping 94432 word types
2018-11-22 11:32:29,996 : INFO : PROGRESS: at sentence #9630000, processed 9878978 words, keeping 94477 word types
2018-11-22 11:32:30,004 : INFO : PROGRESS: at sentence #9640000, processed 9889243 words, keeping 94534 word types
2018-11-22 11:32:30,016 : INFO : PROGRESS: at sentence #9650000, processed 9899595 words, keeping 94568 word types
2018-11-22 11:32:30,025 : INFO : PROGRESS: at sentence #9660000, processed 99097

2018-11-22 11:32:30,634 : INFO : PROGRESS: at sentence #10290000, processed 10556177 words, keeping 97095 word types
2018-11-22 11:32:30,642 : INFO : PROGRESS: at sentence #10300000, processed 10566478 words, keeping 97135 word types
2018-11-22 11:32:30,650 : INFO : PROGRESS: at sentence #10310000, processed 10576755 words, keeping 97188 word types
2018-11-22 11:32:30,658 : INFO : PROGRESS: at sentence #10320000, processed 10586991 words, keeping 97237 word types
2018-11-22 11:32:30,666 : INFO : PROGRESS: at sentence #10330000, processed 10597228 words, keeping 97277 word types
2018-11-22 11:32:30,675 : INFO : PROGRESS: at sentence #10340000, processed 10607486 words, keeping 97307 word types
2018-11-22 11:32:30,683 : INFO : PROGRESS: at sentence #10350000, processed 10617847 words, keeping 97343 word types
2018-11-22 11:32:30,692 : INFO : PROGRESS: at sentence #10360000, processed 10628070 words, keeping 97378 word types
2018-11-22 11:32:30,699 : INFO : PROGRESS: at sentence #10370000

2018-11-22 11:32:31,274 : INFO : PROGRESS: at sentence #11000000, processed 11283657 words, keeping 99816 word types
2018-11-22 11:32:31,283 : INFO : PROGRESS: at sentence #11010000, processed 11293943 words, keeping 99857 word types
2018-11-22 11:32:31,293 : INFO : PROGRESS: at sentence #11020000, processed 11304277 words, keeping 99887 word types
2018-11-22 11:32:31,302 : INFO : PROGRESS: at sentence #11030000, processed 11314588 words, keeping 99942 word types
2018-11-22 11:32:31,311 : INFO : PROGRESS: at sentence #11040000, processed 11324880 words, keeping 99985 word types
2018-11-22 11:32:31,320 : INFO : PROGRESS: at sentence #11050000, processed 11335161 words, keeping 100021 word types
2018-11-22 11:32:31,332 : INFO : PROGRESS: at sentence #11060000, processed 11345354 words, keeping 100049 word types
2018-11-22 11:32:31,343 : INFO : PROGRESS: at sentence #11070000, processed 11355709 words, keeping 100078 word types
2018-11-22 11:32:31,351 : INFO : PROGRESS: at sentence #11080

2018-11-22 11:32:50,991 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-11-22 11:32:50,992 : INFO : EPOCH - 1 : training on 11706951 raw words (8257234 effective words) took 19.1s, 433067 effective words/s
2018-11-22 11:32:52,011 : INFO : EPOCH 2 - PROGRESS: at 5.12% examples, 421115 words/s, in_qsize 0, out_qsize 1
2018-11-22 11:32:52,998 : INFO : EPOCH 2 - PROGRESS: at 10.50% examples, 433073 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:32:53,999 : INFO : EPOCH 2 - PROGRESS: at 15.80% examples, 434093 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:32:55,008 : INFO : EPOCH 2 - PROGRESS: at 21.18% examples, 435712 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:32:56,031 : INFO : EPOCH 2 - PROGRESS: at 26.48% examples, 434400 words/s, in_qsize 1, out_qsize 1
2018-11-22 11:32:57,035 : INFO : EPOCH 2 - PROGRESS: at 31.86% examples, 435770 words/s, in_qsize 0, out_qsize 1
2018-11-22 11:32:58,064 : INFO : EPOCH 2 - PROGRESS: at 37.15% examples, 434378 words/s, in_q

2018-11-22 11:33:52,569 : INFO : EPOCH 5 - PROGRESS: at 26.65% examples, 434871 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:33:53,598 : INFO : EPOCH 5 - PROGRESS: at 32.12% examples, 435642 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:33:54,620 : INFO : EPOCH 5 - PROGRESS: at 37.41% examples, 434276 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:33:55,633 : INFO : EPOCH 5 - PROGRESS: at 42.79% examples, 435361 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:33:56,644 : INFO : EPOCH 5 - PROGRESS: at 48.09% examples, 434786 words/s, in_qsize 0, out_qsize 1
2018-11-22 11:33:57,660 : INFO : EPOCH 5 - PROGRESS: at 53.56% examples, 436008 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:33:58,656 : INFO : EPOCH 5 - PROGRESS: at 58.94% examples, 436747 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:33:59,664 : INFO : EPOCH 5 - PROGRESS: at 64.32% examples, 437059 words/s, in_qsize 0, out_qsize 0
2018-11-22 11:34:00,688 : INFO : EPOCH 5 - PROGRESS: at 69.61% examples, 436295 words/s, in_qsiz

In [28]:
# Esta funcion promedia todos los vectores dentro de un mismo paragraph
def featureVecMethod(words, model, num_features):
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
    index2word_set = set(model.wv.index2word)
    
    for word in  words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec,model[word])
    
    featureVec = np.divide(featureVec, nwords)
    return featureVec

In [29]:
# Calcula el promedio
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(reviews)))
            
        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter+1
        
    return reviewFeatureVecs

In [31]:
# Calcula el promedio de los vectores característicos en el conjunto de entrenamiento 

clean_reviews = []
for review in df['review']:
    clean_reviews.append(review_wordlist(review, remove_stopwords=True))
    
dataVecs = getAvgFeatureVecs(clean_reviews, model, num_features)

Review 0 of 50000


  # Remove the CWD from sys.path while we load stuff.


Review 1000 of 50000
Review 2000 of 50000
Review 3000 of 50000


  if sys.path[0] == '':


Review 4000 of 50000
Review 5000 of 50000
Review 6000 of 50000
Review 7000 of 50000
Review 8000 of 50000
Review 9000 of 50000
Review 10000 of 50000
Review 11000 of 50000
Review 12000 of 50000
Review 13000 of 50000
Review 14000 of 50000
Review 15000 of 50000
Review 16000 of 50000
Review 17000 of 50000
Review 18000 of 50000
Review 19000 of 50000
Review 20000 of 50000
Review 21000 of 50000
Review 22000 of 50000
Review 23000 of 50000
Review 24000 of 50000
Review 25000 of 50000
Review 26000 of 50000
Review 27000 of 50000
Review 28000 of 50000
Review 29000 of 50000
Review 30000 of 50000
Review 31000 of 50000
Review 32000 of 50000
Review 33000 of 50000
Review 34000 of 50000
Review 35000 of 50000
Review 36000 of 50000
Review 37000 of 50000
Review 38000 of 50000
Review 39000 of 50000
Review 40000 of 50000
Review 41000 of 50000
Review 42000 of 50000
Review 43000 of 50000
Review 44000 of 50000
Review 45000 of 50000
Review 46000 of 50000
Review 47000 of 50000
Review 48000 of 50000
Review 49000 of 

In [32]:
dataVecs.shape

(50000, 10)

In [62]:
for i in range(50000):
    if(np.any(np.isnan(dataVecs[i]))):
        dataVecs[i] = dataVecs[i-1]

In [64]:
save2 = pd.DataFrame(data= dataVecs)
save2.to_csv("vectores de 10 dimensiones")

In [68]:
Y = df["sentiment"]

Veamos una prueba:

# Entrenamiento

Definimos nuestros conjuntos de entrenamiento X_tr y Y_tr

In [108]:
X_tr = dataVecs[:50]
Y_tr = Y[:50]
X_tr.shape

(50, 10)

In [109]:
Y_tr = Y_tr[:,np.newaxis]

In [110]:
Y_tr.shape

(50, 1)

Uso de la librería de TensorFlow

In [111]:
import tensorflow as tf

lstm_size = 256
lstm_layers = 2
batch_size = 100
learning_rate = 0.01

In [112]:
#n_words = len(vocab_to_int) + 1 # Add 1 for 0 added to vocab

# Create the graph object
tf.reset_default_graph()
with tf.name_scope('inputs'):
    inputs_ = tf.placeholder(tf.int32, [None, None], name="inputs")
    labels_ = tf.placeholder(tf.int32, [None, None], name="labels")
    keep_prob = tf.placeholder(tf.float32, name="keep_prob")



In [113]:
embed_size = 10 
n_words = 50000
with tf.name_scope("Embeddings"):
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

In [114]:
def lstm_cell():
    # Your basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size, reuse=tf.get_variable_scope().reuse)
    # Add dropout to the cell
    return tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)

with tf.name_scope("RNN_layers"):
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(lstm_layers)])
    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)

In [115]:
with tf.name_scope("RNN_forward"):
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)

In [116]:
with tf.name_scope('predictions'):
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    tf.summary.histogram('predictions', predictions)
with tf.name_scope('cost'):
    cost = tf.losses.mean_squared_error(labels_, predictions)
    tf.summary.scalar('cost', cost)

with tf.name_scope('train'):
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

merged = tf.summary.merge_all()

In [123]:
with tf.name_scope('validation'):
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

In [117]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

Entrenamiento

In [125]:
epochs = 1

#with graph.as_default():
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    train_writer = tf.summary.FileWriter('./logs/tb/train', sess.graph)
    test_writer = tf.summary.FileWriter('./logs/tb/test', sess.graph)
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(X_tr, Y_tr, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            summary, loss, state, _ = sess.run([merged, cost, final_state, optimizer], feed_dict=feed)
            #loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)

            train_writer.add_summary(summary, iteration)
        
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
#                     batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    summary, batch_acc, val_state = sess.run([merged, accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
            test_writer.add_summary(summary, iteration)
            saver.save(sess, "checkpoints/sentiment_manish.ckpt")
    saver.save(sess, "checkpoints/sentiment_manish.ckpt")

# Testeo

Vectores de prueba

In [120]:
X_test = dataVecs[:1000]
Y_test = Y[:1000]
Y_test = Y_test[:,np.newaxis]

In [126]:
test_acc = []
with tf.Session() as sess:
    saver.restore(sess, "checkpoints/sentiment_manish.ckpt")
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(X_test, Y_test, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

2018-11-22 22:46:10,611 : INFO : Restoring parameters from checkpoints/sentiment_manish.ckpt


ValueError: Cannot feed value of shape (100, 1, 1) for Tensor u'inputs/labels:0', which has shape '(?, ?)'

In [127]:
test_acc = []
with tf.Session() as sess:
    saver.restore(sess, "checkpoints/sentiment_manish.ckpt")
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(X_test, Y_test, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

2018-11-22 22:47:59,150 : INFO : Restoring parameters from checkpoints/sentiment_manish.ckpt


ValueError: Cannot feed value of shape (100, 1, 1) for Tensor u'inputs/labels:0', which has shape '(?, ?)'

<br>
<br>