# Aplicación de la Regresión Logística para el Análisis de sentimientos

Adaptado por http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

<br>
<br>

## The IMDb Movie Review Dataset

En esta sección se entrenó una regresión logística para clasificar opiniones de un dataset de 50K IMDb  recolectado por Maas el. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

La base de datos consiste en 50,000 opiniones de películas del original "entrenamiento" y "testeo". Las etiquetas son binarias y contienen 25,000 comentarios positivos y 25,000 comentarios negativos.

Let us shuffle the class labels.

<br>
<br>

## Preprocesamiento de la Información

In [1]:
# Firstly, please note that the performance of google word2vec is better on big datasets. 
# In this example we are considering only 25000 training examples from the imdb dataset.
# Therefore, the performance is similar to the "bag of words" model.

# Importing libraries
import numpy as np
import pandas as pd
# BeautifulSoup is used to remove html tags from the text
from bs4 import BeautifulSoup 
import re # For regular expressions

# Stopwords can be useful to undersand the semantics of the sentence.
# Therefore stopwords are not removed while creating the word2vec model.
# But they will be removed  while averaging feature vectors.
from nltk.corpus import stopwords

In [2]:
df = pd.read_csv('shuffled_movie_data.csv')

In [3]:
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


In [4]:
# Convirtiendo el texto a una secuencia de palabras
def review_wordlist(review, remove_stopwords=False):
    # 1. removiendo html tags
    review_text = BeautifulSoup(review).get_text()
    # 2. considerando las letras
    review_text = re.sub("[^a-zA-Z]"," ",review_text)
    # 3. convirtiendo todo a lower-case
    words = review_text.lower().split()
    # 4. remover stop-words
    if remove_stopwords:
        stops = set(stopwords.words("english"))     
        words = [w for w in words if not w in stops]
    
    return(words)

In [5]:
import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [6]:
# Funcion Tokenizer
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    text = re.sub('!', ' ! ', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    #text = re.sub('[\W]+' + exclamation + ']', ' ', text) + ' '.join(emoticons).replace('-', '')
    text=text.split()
    #text = [w for w in text.split() if w not in stop]
    #tokenized = [porter.stem(w) for w in text]
    return text

In [7]:
# Esta funcion divide los comntarios en oraciones
def review_sentences(review, tokenizer, remove_stopwords=False):
    raw_sentences = tokenizer(review)
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence)>0:
            sentences.append(review_wordlist(raw_sentence,\
                                            remove_stopwords))

    return sentences

In [8]:
df["review"][0]

'In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich famil

In [9]:
sentences = []
print("Parsing sentences from training set")
for review in df["review"]:
    sentences += review_sentences(review, tokenizer)

Parsing sentences from training set


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beauti

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that d

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that d

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Sou

## Extracción de las características del texto usando word2vector

In [10]:
# Importing the built-in logging module
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [289]:
# Creating the model and setting values for the various parameters
num_features = 30  # dimension del vector de palabra
min_word_count = 40 # cantidad minima de conteo
num_workers = 4     # hilos
context = 10        # window size
downsampling = 1e-3 # (0.001) Downsample setting for frequent words

# Initializing the train model
from gensim.models import word2vec
print("Training model....")
model = word2vec.Word2Vec(sentences,workers=num_workers,size=num_features,min_count=min_word_count, window=context, sample=downsampling)

model.init_sims(replace=True)

model_name = "miModelo"
model.save(model_name)

2018-11-08 21:37:52,815 : INFO : collecting all words and their counts
2018-11-08 21:37:52,817 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-08 21:37:52,825 : INFO : PROGRESS: at sentence #10000, processed 10218 words, keeping 2582 word types
2018-11-08 21:37:52,832 : INFO : PROGRESS: at sentence #20000, processed 20503 words, keeping 4279 word types
2018-11-08 21:37:52,839 : INFO : PROGRESS: at sentence #30000, processed 30704 words, keeping 5483 word types
2018-11-08 21:37:52,846 : INFO : PROGRESS: at sentence #40000, processed 40952 words, keeping 6665 word types
2018-11-08 21:37:52,853 : INFO : PROGRESS: at sentence #50000, processed 51224 words, keeping 7506 word types
2018-11-08 21:37:52,860 : INFO : PROGRESS: at sentence #60000, processed 61525 words, keeping 8453 word types
2018-11-08 21:37:52,867 : INFO : PROGRESS: at sentence #70000, processed 71702 words, keeping 9402 word types
2018-11-08 21:37:52,874 : INFO : PROGRESS: at sentence #8000

Training model....


2018-11-08 21:37:53,020 : INFO : PROGRESS: at sentence #250000, processed 256688 words, keeping 19033 word types
2018-11-08 21:37:53,028 : INFO : PROGRESS: at sentence #260000, processed 266872 words, keeping 19416 word types
2018-11-08 21:37:53,037 : INFO : PROGRESS: at sentence #270000, processed 277153 words, keeping 19791 word types
2018-11-08 21:37:53,046 : INFO : PROGRESS: at sentence #280000, processed 287442 words, keeping 20148 word types
2018-11-08 21:37:53,053 : INFO : PROGRESS: at sentence #290000, processed 297608 words, keeping 20498 word types
2018-11-08 21:37:53,062 : INFO : PROGRESS: at sentence #300000, processed 307890 words, keeping 20811 word types
2018-11-08 21:37:53,070 : INFO : PROGRESS: at sentence #310000, processed 318114 words, keeping 21152 word types
2018-11-08 21:37:53,078 : INFO : PROGRESS: at sentence #320000, processed 328347 words, keeping 21463 word types
2018-11-08 21:37:53,088 : INFO : PROGRESS: at sentence #330000, processed 338591 words, keeping 

2018-11-08 21:37:53,647 : INFO : PROGRESS: at sentence #980000, processed 1005803 words, keeping 36355 word types
2018-11-08 21:37:53,655 : INFO : PROGRESS: at sentence #990000, processed 1016011 words, keeping 36487 word types
2018-11-08 21:37:53,663 : INFO : PROGRESS: at sentence #1000000, processed 1026244 words, keeping 36634 word types
2018-11-08 21:37:53,672 : INFO : PROGRESS: at sentence #1010000, processed 1036533 words, keeping 36775 word types
2018-11-08 21:37:53,679 : INFO : PROGRESS: at sentence #1020000, processed 1046704 words, keeping 36954 word types
2018-11-08 21:37:53,689 : INFO : PROGRESS: at sentence #1030000, processed 1056992 words, keeping 37091 word types
2018-11-08 21:37:53,698 : INFO : PROGRESS: at sentence #1040000, processed 1067279 words, keeping 37247 word types
2018-11-08 21:37:53,706 : INFO : PROGRESS: at sentence #1050000, processed 1077492 words, keeping 37440 word types
2018-11-08 21:37:53,717 : INFO : PROGRESS: at sentence #1060000, processed 1087715

2018-11-08 21:37:54,244 : INFO : PROGRESS: at sentence #1700000, processed 1744271 words, keeping 46293 word types
2018-11-08 21:37:54,251 : INFO : PROGRESS: at sentence #1710000, processed 1754579 words, keeping 46410 word types
2018-11-08 21:37:54,258 : INFO : PROGRESS: at sentence #1720000, processed 1764817 words, keeping 46537 word types
2018-11-08 21:37:54,267 : INFO : PROGRESS: at sentence #1730000, processed 1775107 words, keeping 46641 word types
2018-11-08 21:37:54,275 : INFO : PROGRESS: at sentence #1740000, processed 1785407 words, keeping 46793 word types
2018-11-08 21:37:54,285 : INFO : PROGRESS: at sentence #1750000, processed 1795763 words, keeping 46903 word types
2018-11-08 21:37:54,293 : INFO : PROGRESS: at sentence #1760000, processed 1806090 words, keeping 47005 word types
2018-11-08 21:37:54,302 : INFO : PROGRESS: at sentence #1770000, processed 1816418 words, keeping 47094 word types
2018-11-08 21:37:54,310 : INFO : PROGRESS: at sentence #1780000, processed 18266

2018-11-08 21:37:54,831 : INFO : PROGRESS: at sentence #2420000, processed 2483128 words, keeping 53877 word types
2018-11-08 21:37:54,839 : INFO : PROGRESS: at sentence #2430000, processed 2493364 words, keeping 54019 word types
2018-11-08 21:37:54,847 : INFO : PROGRESS: at sentence #2440000, processed 2503615 words, keeping 54083 word types
2018-11-08 21:37:54,856 : INFO : PROGRESS: at sentence #2450000, processed 2513887 words, keeping 54170 word types
2018-11-08 21:37:54,863 : INFO : PROGRESS: at sentence #2460000, processed 2524152 words, keeping 54240 word types
2018-11-08 21:37:54,872 : INFO : PROGRESS: at sentence #2470000, processed 2534401 words, keeping 54321 word types
2018-11-08 21:37:54,879 : INFO : PROGRESS: at sentence #2480000, processed 2544627 words, keeping 54406 word types
2018-11-08 21:37:54,888 : INFO : PROGRESS: at sentence #2490000, processed 2554838 words, keeping 54495 word types
2018-11-08 21:37:54,896 : INFO : PROGRESS: at sentence #2500000, processed 25651

2018-11-08 21:37:55,415 : INFO : PROGRESS: at sentence #3140000, processed 3221156 words, keeping 59954 word types
2018-11-08 21:37:55,422 : INFO : PROGRESS: at sentence #3150000, processed 3231435 words, keeping 60034 word types
2018-11-08 21:37:55,430 : INFO : PROGRESS: at sentence #3160000, processed 3241661 words, keeping 60115 word types
2018-11-08 21:37:55,438 : INFO : PROGRESS: at sentence #3170000, processed 3251913 words, keeping 60182 word types
2018-11-08 21:37:55,447 : INFO : PROGRESS: at sentence #3180000, processed 3262135 words, keeping 60247 word types
2018-11-08 21:37:55,454 : INFO : PROGRESS: at sentence #3190000, processed 3272448 words, keeping 60344 word types
2018-11-08 21:37:55,463 : INFO : PROGRESS: at sentence #3200000, processed 3282710 words, keeping 60420 word types
2018-11-08 21:37:55,471 : INFO : PROGRESS: at sentence #3210000, processed 3292978 words, keeping 60518 word types
2018-11-08 21:37:55,480 : INFO : PROGRESS: at sentence #3220000, processed 33032

2018-11-08 21:37:55,998 : INFO : PROGRESS: at sentence #3860000, processed 3959551 words, keeping 65125 word types
2018-11-08 21:37:56,005 : INFO : PROGRESS: at sentence #3870000, processed 3969834 words, keeping 65177 word types
2018-11-08 21:37:56,013 : INFO : PROGRESS: at sentence #3880000, processed 3980105 words, keeping 65246 word types
2018-11-08 21:37:56,021 : INFO : PROGRESS: at sentence #3890000, processed 3990339 words, keeping 65319 word types
2018-11-08 21:37:56,029 : INFO : PROGRESS: at sentence #3900000, processed 4000615 words, keeping 65386 word types
2018-11-08 21:37:56,038 : INFO : PROGRESS: at sentence #3910000, processed 4010916 words, keeping 65458 word types
2018-11-08 21:37:56,047 : INFO : PROGRESS: at sentence #3920000, processed 4021176 words, keeping 65519 word types
2018-11-08 21:37:56,054 : INFO : PROGRESS: at sentence #3930000, processed 4031423 words, keeping 65593 word types
2018-11-08 21:37:56,062 : INFO : PROGRESS: at sentence #3940000, processed 40417

2018-11-08 21:37:56,555 : INFO : PROGRESS: at sentence #4580000, processed 4698393 words, keeping 69772 word types
2018-11-08 21:37:56,563 : INFO : PROGRESS: at sentence #4590000, processed 4708684 words, keeping 69829 word types
2018-11-08 21:37:56,571 : INFO : PROGRESS: at sentence #4600000, processed 4718965 words, keeping 69875 word types
2018-11-08 21:37:56,580 : INFO : PROGRESS: at sentence #4610000, processed 4729179 words, keeping 69961 word types
2018-11-08 21:37:56,588 : INFO : PROGRESS: at sentence #4620000, processed 4739436 words, keeping 70014 word types
2018-11-08 21:37:56,596 : INFO : PROGRESS: at sentence #4630000, processed 4749730 words, keeping 70058 word types
2018-11-08 21:37:56,605 : INFO : PROGRESS: at sentence #4640000, processed 4760045 words, keeping 70137 word types
2018-11-08 21:37:56,613 : INFO : PROGRESS: at sentence #4650000, processed 4770352 words, keeping 70213 word types
2018-11-08 21:37:56,620 : INFO : PROGRESS: at sentence #4660000, processed 47805

2018-11-08 21:37:57,147 : INFO : PROGRESS: at sentence #5300000, processed 5437191 words, keeping 74170 word types
2018-11-08 21:37:57,157 : INFO : PROGRESS: at sentence #5310000, processed 5447496 words, keeping 74211 word types
2018-11-08 21:37:57,165 : INFO : PROGRESS: at sentence #5320000, processed 5457709 words, keeping 74275 word types
2018-11-08 21:37:57,174 : INFO : PROGRESS: at sentence #5330000, processed 5467882 words, keeping 74323 word types
2018-11-08 21:37:57,182 : INFO : PROGRESS: at sentence #5340000, processed 5478099 words, keeping 74391 word types
2018-11-08 21:37:57,191 : INFO : PROGRESS: at sentence #5350000, processed 5488384 words, keeping 74454 word types
2018-11-08 21:37:57,201 : INFO : PROGRESS: at sentence #5360000, processed 5498639 words, keeping 74502 word types
2018-11-08 21:37:57,211 : INFO : PROGRESS: at sentence #5370000, processed 5508833 words, keeping 74575 word types
2018-11-08 21:37:57,219 : INFO : PROGRESS: at sentence #5380000, processed 55190

2018-11-08 21:37:57,751 : INFO : PROGRESS: at sentence #6020000, processed 6175285 words, keeping 78121 word types
2018-11-08 21:37:57,758 : INFO : PROGRESS: at sentence #6030000, processed 6185599 words, keeping 78175 word types
2018-11-08 21:37:57,769 : INFO : PROGRESS: at sentence #6040000, processed 6195883 words, keeping 78222 word types
2018-11-08 21:37:57,778 : INFO : PROGRESS: at sentence #6050000, processed 6206219 words, keeping 78274 word types
2018-11-08 21:37:57,786 : INFO : PROGRESS: at sentence #6060000, processed 6216545 words, keeping 78321 word types
2018-11-08 21:37:57,793 : INFO : PROGRESS: at sentence #6070000, processed 6226777 words, keeping 78372 word types
2018-11-08 21:37:57,801 : INFO : PROGRESS: at sentence #6080000, processed 6237077 words, keeping 78434 word types
2018-11-08 21:37:57,809 : INFO : PROGRESS: at sentence #6090000, processed 6247314 words, keeping 78486 word types
2018-11-08 21:37:57,818 : INFO : PROGRESS: at sentence #6100000, processed 62576

2018-11-08 21:37:58,331 : INFO : PROGRESS: at sentence #6740000, processed 6914201 words, keeping 81690 word types
2018-11-08 21:37:58,338 : INFO : PROGRESS: at sentence #6750000, processed 6924452 words, keeping 81735 word types
2018-11-08 21:37:58,346 : INFO : PROGRESS: at sentence #6760000, processed 6934707 words, keeping 81791 word types
2018-11-08 21:37:58,353 : INFO : PROGRESS: at sentence #6770000, processed 6944926 words, keeping 81842 word types
2018-11-08 21:37:58,361 : INFO : PROGRESS: at sentence #6780000, processed 6955169 words, keeping 81883 word types
2018-11-08 21:37:58,368 : INFO : PROGRESS: at sentence #6790000, processed 6965441 words, keeping 81917 word types
2018-11-08 21:37:58,376 : INFO : PROGRESS: at sentence #6800000, processed 6975627 words, keeping 81961 word types
2018-11-08 21:37:58,382 : INFO : PROGRESS: at sentence #6810000, processed 6985909 words, keeping 82011 word types
2018-11-08 21:37:58,391 : INFO : PROGRESS: at sentence #6820000, processed 69961

2018-11-08 21:37:58,889 : INFO : PROGRESS: at sentence #7460000, processed 7653206 words, keeping 85181 word types
2018-11-08 21:37:58,896 : INFO : PROGRESS: at sentence #7470000, processed 7663528 words, keeping 85219 word types
2018-11-08 21:37:58,905 : INFO : PROGRESS: at sentence #7480000, processed 7673726 words, keeping 85281 word types
2018-11-08 21:37:58,912 : INFO : PROGRESS: at sentence #7490000, processed 7684033 words, keeping 85346 word types
2018-11-08 21:37:58,920 : INFO : PROGRESS: at sentence #7500000, processed 7694277 words, keeping 85392 word types
2018-11-08 21:37:58,929 : INFO : PROGRESS: at sentence #7510000, processed 7704394 words, keeping 85423 word types
2018-11-08 21:37:58,938 : INFO : PROGRESS: at sentence #7520000, processed 7714675 words, keeping 85458 word types
2018-11-08 21:37:58,948 : INFO : PROGRESS: at sentence #7530000, processed 7724949 words, keeping 85513 word types
2018-11-08 21:37:58,957 : INFO : PROGRESS: at sentence #7540000, processed 77352

2018-11-08 21:37:59,543 : INFO : PROGRESS: at sentence #8180000, processed 8391818 words, keeping 88338 word types
2018-11-08 21:37:59,552 : INFO : PROGRESS: at sentence #8190000, processed 8402087 words, keeping 88397 word types
2018-11-08 21:37:59,559 : INFO : PROGRESS: at sentence #8200000, processed 8412337 words, keeping 88442 word types
2018-11-08 21:37:59,569 : INFO : PROGRESS: at sentence #8210000, processed 8422619 words, keeping 88505 word types
2018-11-08 21:37:59,577 : INFO : PROGRESS: at sentence #8220000, processed 8432832 words, keeping 88569 word types
2018-11-08 21:37:59,586 : INFO : PROGRESS: at sentence #8230000, processed 8443075 words, keeping 88609 word types
2018-11-08 21:37:59,594 : INFO : PROGRESS: at sentence #8240000, processed 8453287 words, keeping 88659 word types
2018-11-08 21:37:59,603 : INFO : PROGRESS: at sentence #8250000, processed 8463659 words, keeping 88703 word types
2018-11-08 21:37:59,610 : INFO : PROGRESS: at sentence #8260000, processed 84739

2018-11-08 21:38:00,117 : INFO : PROGRESS: at sentence #8900000, processed 9129871 words, keeping 91370 word types
2018-11-08 21:38:00,126 : INFO : PROGRESS: at sentence #8910000, processed 9140158 words, keeping 91412 word types
2018-11-08 21:38:00,133 : INFO : PROGRESS: at sentence #8920000, processed 9150420 words, keeping 91494 word types
2018-11-08 21:38:00,142 : INFO : PROGRESS: at sentence #8930000, processed 9160758 words, keeping 91525 word types
2018-11-08 21:38:00,149 : INFO : PROGRESS: at sentence #8940000, processed 9171071 words, keeping 91573 word types
2018-11-08 21:38:00,157 : INFO : PROGRESS: at sentence #8950000, processed 9181360 words, keeping 91620 word types
2018-11-08 21:38:00,164 : INFO : PROGRESS: at sentence #8960000, processed 9191640 words, keeping 91675 word types
2018-11-08 21:38:00,173 : INFO : PROGRESS: at sentence #8970000, processed 9201881 words, keeping 91716 word types
2018-11-08 21:38:00,181 : INFO : PROGRESS: at sentence #8980000, processed 92121

2018-11-08 21:38:00,708 : INFO : PROGRESS: at sentence #9620000, processed 9868730 words, keeping 94432 word types
2018-11-08 21:38:00,716 : INFO : PROGRESS: at sentence #9630000, processed 9878978 words, keeping 94477 word types
2018-11-08 21:38:00,724 : INFO : PROGRESS: at sentence #9640000, processed 9889243 words, keeping 94534 word types
2018-11-08 21:38:00,732 : INFO : PROGRESS: at sentence #9650000, processed 9899595 words, keeping 94568 word types
2018-11-08 21:38:00,739 : INFO : PROGRESS: at sentence #9660000, processed 9909794 words, keeping 94621 word types
2018-11-08 21:38:00,747 : INFO : PROGRESS: at sentence #9670000, processed 9919950 words, keeping 94674 word types
2018-11-08 21:38:00,755 : INFO : PROGRESS: at sentence #9680000, processed 9930201 words, keeping 94720 word types
2018-11-08 21:38:00,762 : INFO : PROGRESS: at sentence #9690000, processed 9940442 words, keeping 94756 word types
2018-11-08 21:38:00,770 : INFO : PROGRESS: at sentence #9700000, processed 99506

2018-11-08 21:38:01,321 : INFO : PROGRESS: at sentence #10330000, processed 10597228 words, keeping 97277 word types
2018-11-08 21:38:01,331 : INFO : PROGRESS: at sentence #10340000, processed 10607486 words, keeping 97307 word types
2018-11-08 21:38:01,340 : INFO : PROGRESS: at sentence #10350000, processed 10617847 words, keeping 97343 word types
2018-11-08 21:38:01,351 : INFO : PROGRESS: at sentence #10360000, processed 10628070 words, keeping 97378 word types
2018-11-08 21:38:01,361 : INFO : PROGRESS: at sentence #10370000, processed 10638363 words, keeping 97427 word types
2018-11-08 21:38:01,372 : INFO : PROGRESS: at sentence #10380000, processed 10648552 words, keeping 97458 word types
2018-11-08 21:38:01,382 : INFO : PROGRESS: at sentence #10390000, processed 10658690 words, keeping 97491 word types
2018-11-08 21:38:01,391 : INFO : PROGRESS: at sentence #10400000, processed 10668972 words, keeping 97534 word types
2018-11-08 21:38:01,402 : INFO : PROGRESS: at sentence #10410000

2018-11-08 21:38:01,905 : INFO : PROGRESS: at sentence #11040000, processed 11324880 words, keeping 99985 word types
2018-11-08 21:38:01,915 : INFO : PROGRESS: at sentence #11050000, processed 11335161 words, keeping 100021 word types
2018-11-08 21:38:01,925 : INFO : PROGRESS: at sentence #11060000, processed 11345354 words, keeping 100049 word types
2018-11-08 21:38:01,936 : INFO : PROGRESS: at sentence #11070000, processed 11355709 words, keeping 100078 word types
2018-11-08 21:38:01,943 : INFO : PROGRESS: at sentence #11080000, processed 11365937 words, keeping 100115 word types
2018-11-08 21:38:01,950 : INFO : PROGRESS: at sentence #11090000, processed 11376130 words, keeping 100151 word types
2018-11-08 21:38:01,958 : INFO : PROGRESS: at sentence #11100000, processed 11386363 words, keeping 100192 word types
2018-11-08 21:38:01,966 : INFO : PROGRESS: at sentence #11110000, processed 11396659 words, keeping 100239 word types
2018-11-08 21:38:01,973 : INFO : PROGRESS: at sentence #1

2018-11-08 21:38:24,081 : INFO : EPOCH 2 - PROGRESS: at 10.33% examples, 422019 words/s, in_qsize 4, out_qsize 0
2018-11-08 21:38:25,121 : INFO : EPOCH 2 - PROGRESS: at 15.63% examples, 421953 words/s, in_qsize 0, out_qsize 2
2018-11-08 21:38:26,187 : INFO : EPOCH 2 - PROGRESS: at 21.01% examples, 420803 words/s, in_qsize 1, out_qsize 2
2018-11-08 21:38:27,194 : INFO : EPOCH 2 - PROGRESS: at 25.79% examples, 415476 words/s, in_qsize 6, out_qsize 0
2018-11-08 21:38:28,206 : INFO : EPOCH 2 - PROGRESS: at 31.35% examples, 421031 words/s, in_qsize 1, out_qsize 0
2018-11-08 21:38:29,226 : INFO : EPOCH 2 - PROGRESS: at 36.30% examples, 418228 words/s, in_qsize 8, out_qsize 0
2018-11-08 21:38:30,281 : INFO : EPOCH 2 - PROGRESS: at 42.20% examples, 424284 words/s, in_qsize 2, out_qsize 1
2018-11-08 21:38:31,290 : INFO : EPOCH 2 - PROGRESS: at 47.24% examples, 423392 words/s, in_qsize 4, out_qsize 3
2018-11-08 21:38:32,284 : INFO : EPOCH 2 - PROGRESS: at 52.71% examples, 426264 words/s, in_qsiz

2018-11-08 21:39:27,269 : INFO : EPOCH 5 - PROGRESS: at 42.45% examples, 431430 words/s, in_qsize 1, out_qsize 0
2018-11-08 21:39:28,269 : INFO : EPOCH 5 - PROGRESS: at 47.49% examples, 429430 words/s, in_qsize 7, out_qsize 0
2018-11-08 21:39:29,278 : INFO : EPOCH 5 - PROGRESS: at 52.88% examples, 430750 words/s, in_qsize 7, out_qsize 1
2018-11-08 21:39:30,301 : INFO : EPOCH 5 - PROGRESS: at 58.17% examples, 430404 words/s, in_qsize 8, out_qsize 0
2018-11-08 21:39:31,331 : INFO : EPOCH 5 - PROGRESS: at 63.55% examples, 430460 words/s, in_qsize 7, out_qsize 2
2018-11-08 21:39:32,373 : INFO : EPOCH 5 - PROGRESS: at 69.18% examples, 432068 words/s, in_qsize 2, out_qsize 3
2018-11-08 21:39:33,378 : INFO : EPOCH 5 - PROGRESS: at 74.91% examples, 434743 words/s, in_qsize 2, out_qsize 0
2018-11-08 21:39:34,392 : INFO : EPOCH 5 - PROGRESS: at 80.04% examples, 433364 words/s, in_qsize 6, out_qsize 0
2018-11-08 21:39:35,416 : INFO : EPOCH 5 - PROGRESS: at 85.42% examples, 433502 words/s, in_qsiz

In [290]:
# Esta funcion promedia todos los vectores dentro de un mismo paragraph
def featureVecMethod(words, model, num_features):
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
    index2word_set = set(model.wv.index2word)
    
    for word in  words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec,model[word])
    
    featureVec = np.divide(featureVec, nwords)
    return featureVec

In [291]:
# Calcula el promedio
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(reviews)))
            
        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter+1
        
    return reviewFeatureVecs

In [529]:
# Calcula el promedio de los vectores característicos en el conjunto de entrenamiento 

clean_reviews = []
for review in df['review']:
    clean_train_reviews.append(review_wordlist(review, remove_stopwords=True))
    
dataVecs = getAvgFeatureVecs(clean_reviews, model, num_features)

Veamos una prueba:

## Red Neuronal multicapa

In [47]:
#La funcipon sigmoide
def sigmoide(z):
    return 1/(1+np.exp(-z))

In [525]:
#Clase de una red neuronal multicapa
class RedNeuronal3:
    
    # Definimos las variables de las matrices
    def __init__(self):
        self.W1 = np.random.random((30,30))-0.5
        self.W2 = np.random.random((1,30))-0.5
        self.b1 = np.random.random((30,1))-0.5
        self.b2 = np.random.random((1,1))-0.5
    
    # Salida de la red neuronal
    def respuesta(self,x):
        z1 = np.matmul(self.W1,x)+self.b1
        a1 = sigmoide(z1)
        z2 = np.matmul(self.W2,a1)+self.b2
        a2 = sigmoide(z2)
        return a2
    
    # L 2
    def d_L2(self, x, y):
        dL2 = self.respuesta(x)-y
        return dL2
    
    # L 1
    def d_L1(self, x, y):
        dL2 = self.d_L2(x,y)
        Q = np.transpose(self.W2)*(sigmoide(x)*(1-sigmoide(x)))
        dL1 =dL2*Q
        return dL1
    
    
    # Entrenamiento
    def entrenamiento(self,x_tr,y_tr):
        cantidad = x_tr.shape[1]
        alfa = 0.000000001

        for j in range(20000):
            dL2W = 0
            dL2B = 0
            for i in range(cantidad):
                x = x_tr[:,i]
                x = x[:,np.newaxis]
                dL2b = alfa*self.d_L2(x,y_tr[i])
                dL2w = alfa*dL2b*np.transpose(x)
                dL2B += dL2b
                dL2W += dL2w
            dL2B = dL2B/(1.0*cantidad)
            dL2W = dL2W/(1.0*cantidad)
            self.W2 += dL2W  
            self.b2 += dL2B
    
            dL1W = 0
            dL1B = 0
            for i in range(cantidad):
                x = x_tr[:,i]
                x = x[:,np.newaxis]
                z1 = np.matmul(self.W1,x)+self.b1
                a1 = sigmoide(z1)
                dL1b = alfa*self.d_L1(a1,y_tr[i])
                dL1w = alfa*np.matmul(dL1b,np.transpose(x))
                dL1B += dL1b
                dL1W += dL1w
            dL1B = dL1B/(1.0*cantidad)
            dL1W = dL1W/(1.0*cantidad)
            self.W1 += np.matmul(dL1W,x)
            self.b1 += dL1B
    
    # Testeo
    def testeo(self,x_test,y_test):
        cantidad=y_test.shape[0]
        acuracy=0
        for i in range(cantidad):
            x = x_test[:,i]
            x = x[:,np.newaxis]
            y_est = np.round(self.respuesta(x))
            if y_est==y[i]:
                acuracy += 1
        
        acuracy = (100.0*acuracy)/cantidad
        return acuracy
            

## Entrenamiento

Definimos nuestros conjuntos de entrenamiento X_tr y Y_tr

In [513]:
X =np.transpose(dataVecs)
X.shape

(30, 50000)

In [514]:
Z = np.hsplit(X,2)
X_tr = Z[0]
X_tr.shape

(30, 100)

In [515]:
y = df["sentiment"]
y = y[:,np.newaxis]
Y_tr=y[:25000]
Y_tr.shape

(100, 1)

Creamos un objeto, nuestra red neuronal

In [521]:
red = RedNeuronal3()

Entrenamos nuestra red neuronal

In [526]:
red.entrenamiento(X_tr,Y_tr)

## Testeo

Definimos el conjunto de datos de prueba

In [387]:
X_test = Z[1]
Y_test = y[25000:50000]
X_test.shape

(30, 1000)

Y aplicamos el testeo con su correspondiente porcentaje de acierto

In [528]:
red.testeo(X_tr,Y_tr)

53.0

<br>
<br>