# Análisis de Sentimientos a Nivel de Texto: Pruebas comparativas
---

> **Proyecto final de Asignatura Sistemas Computacionales <br>
Escuela de Ingeniería de Sistemas <br>
Universidad de Los Andes <br>
Autor: Jhonathan Abreu <br>**

---
<br>

La función del presente notebook es generar estadísticas que permitan decidir el mejor vectorizador y clasificador a utilizar.

<br>

## Módulos necesarios
---

Los módulos necesarios son los siguientes:

*   **gensim**: para el modelo Word2Vec.
*   **tqdm**: una barra de progreso.
*   **unidecode**: para la eliminacion de acentos.




In [2]:
!pip install gensim  # For the Word2Vec model
!pip install tqdm    # Just for using a progress bar
!pip install bokeh   # For graphs
!pip install unidecode



## Subida del dataset
---

Si está utilizando Google Colaboratory, suba el dataset con el siguiente código. En caso contrario, escriba la ruta absoluta del archivo en el bloque de código de ingesta.

In [2]:
from google.colab import files

uploaded = files.upload()

Saving labeled_data.csv to labeled_data.csv


## Carga delibrerías y módulos
---

In [3]:
import os
import pandas as pd

pd.set_option('display.max_colwidth', -1)

import numpy as np
from copy import deepcopy
from string import punctuation
from random import shuffle
import io
import csv

import gensim
from gensim.models.word2vec import Word2Vec
from gensim.utils import simple_preprocess

from tqdm import tqdm

# Descarga y carga de la lista de palabras vacías de NLTK
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.data import load
from nltk.stem import SnowballStemmer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.manifold import TSNE
from sklearn.preprocessing import scale
from sklearn import svm

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers.convolutional import Conv1D
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping

from unidecode import unidecode

import time

pd.options.mode.chained_assignment = None
tqdm.pandas(desc="progress-bar")
nltk.download('stopwords')

Using TensorFlow backend.


[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Ingesta del dataset
---

Con este código, cargamos los datos del archivo.

In [4]:
def ingest(datasetFileName):
    data = pd.read_csv(datasetFileName, header = None)
    data.columns = ['sentences', 'sentiment']
    data['sentiment'] = data['sentiment'].map({
                                                'positivo': 2,
                                                'neutral': 1,
                                                'negativo': 0
                                              })
    data.reset_index(inplace = True)
    data.drop('index', axis = 1, inplace = True)
    print('dataset loaded with shape', data.shape)
    return data

datasetFileName = 'labeled_data.csv'
data = ingest(datasetFileName)
display(data.head())

trainSentences, testSentences, trainLabels, testLabels = \
    train_test_split(np.array(data.sentences),
                     np.array(data.sentiment), test_size = 0.2)

dataset loaded with shape (600, 2)


Unnamed: 0,sentences,sentiment
0,"La canción que acaba de sacar Sia es muy buena, suena muy bien",2
1,"La música de Maluma es realmente mala y sin sentido, no se como es que a las persona les gusta",0
2,Hoy es un día normal y tranquilo,1
3,Que bello día,2
4,Estoy enojado contigo,0


## Pruebas
---

Se van a probar diferentes SVM y Red Neuronal Convolucional con dos tipos de vectorizadores, Word2Vec y CountVectorizer (Bag-of-Words, propuesto en el artículo).

In [0]:
################################################################################
## FUNCIONES DE LIMPIEZA DEL DATASET ###########################################
################################################################################

nltk.download('punkt')

nltk.download('stopwords')
spanishStopWords = stopwords.words('spanish')

# Carga y extensión de la lista de signos de puntuación y otros símbolos.
from string import punctuation

nonWords = list(punctuation)
nonWords.extend(['¿', '¡'])  # Se agregan estos símbolos (español)
nonWords.extend(map(str,range(10)))  # Se agregan los dígitos numéricos

# Stemmer, objeto que llevará las palabras a sus raíces 
stemmer = SnowballStemmer('spanish')

# Función que aplica el stemming
def stem_tokens(tokens, stemmer):
    stemmedTokens = []
    for token in tokens:
        stemmedTokens.append(stemmer.stem(token))
        
    return stemmedTokens

# Función que limpia y tokeniza las frases
def tokenize(text):
    # Eliminación de símbolos y números
    text = ''.join([c for c in text if c not in nonWords])
    # Tokeninazión
    tokens =  word_tokenize(text)

    # Stemming
    try:
        stemmedTokens = stem_tokens(tokens, stemmer)
    except Exception as e:
        print(e)
        print(text)
        stemmedTokens = ['']
        
    return stemmedTokens

################################################################################



################################################################################
## PRUEBAS CON WORD2VEC ########################################################
################################################################################

comparisonResults = None

# Word2Vec

def buildSentenceVector(tokens, size, wordsModel):
    vec = np.zeros(size).reshape((1, size))
    count = 0
        
    for word in tokens:
        try:
            vec += wordsModel[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: # handling the case where the token is not
                         # in the corpus. useful for testing.
            continue
    if count != 0:
        vec /= count
    return vec

def compareSVMClassifiers(classifiers, xTrain, yTrain, xTest, yTest,
                          verbose = False):
    nClassifiers = len(classifiers.keys())
    culumnNames = ['Clasificador', 'Exactitud Entrenamiento',
                   'Exactitud Prueba']
    results = pd.DataFrame(data = np.zeros(shape = (nClassifiers, 3)),
                           columns = culumnNames)
    counter = 0
    for key, classifier in classifiers.items():
        tic = time.clock()
        classifier.fit(xTrain, yTrain)
        toc = time.clock()
        elapsedTime = toc - tic
        trainAccuracy = classifier.score(xTrain, yTrain)
        testAccuracy = classifier.score(xTest, yTest)
        results.loc[counter, 'Clasificador'] = key
        results.loc[counter, 'Exactitud Entrenamiento'] = trainAccuracy
        results.loc[counter, 'Exactitud Prueba'] = testAccuracy
        if verbose:
            print("{c} entrenado en {f:.2f} s".format(c = key, f = elapsedTime))
        counter += 1
        
    return results

print(('\n\n*****************************************************************\n'
       'INICIANDO PRUEBAS CON WORD2VEC\n'
       '*****************************************************************\n'
       '\n'))

vectorDimensions = [200, 500, 800, 1000]

for vectorDimension in vectorDimensions:
    print(('\t***************************************************************\n'
           '\tW2V {}\n'
           '\t***************************************************************\n'
           '\t\n').format(vectorDimension))
    
    svmsWithW2V = {}
    
    trainSentencesTokens = [sentence.lower() for sentence in trainSentences]
    trainSentencesTokens = [unidecode(sentence)
                            for sentence in trainSentencesTokens]
    trainSentencesTokens = [tokenize(sentence)
                            for sentence in trainSentencesTokens]
    
    testSentencesTokens = [sentence.lower() for sentence in testSentences]
    testSentencesTokens = [unidecode(sentence)
                            for sentence in testSentencesTokens]
    testSentencesTokens = [tokenize(sentence)
                            for sentence in testSentencesTokens]
    
    wordsModel = Word2Vec(trainSentencesTokens, size = vectorDimension,
                          min_count = 3, window = 5)
    wordsModel.train([train for train in tqdm(trainSentencesTokens)],
                     total_examples = len(trainSentencesTokens), epochs = 15)

    vectorizer = TfidfVectorizer(analyzer = lambda x: x)
    matrix = vectorizer.fit_transform([sentence
                                       for sentence in tqdm(trainSentencesTokens)])
    tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
    
    trainSentencesVectors = np.concatenate([buildSentenceVector(w, vectorDimension,
                                                                wordsModel)
                                            for w in tqdm(trainSentencesTokens)])
    trainSentencesVectors = scale(trainSentencesVectors)

    testSentencesVectors = np.concatenate([buildSentenceVector(w, vectorDimension,
                                                               wordsModel)
                                           for w in tqdm(testSentencesTokens)])
    testSentencesVectors = scale(testSentencesVectors)

    # SVM Lineal
    print(('\n\t\t*************************************************************\n'
           '\t\tSVM Lineal\n'
           '\t\t*************************************************************\n'
           '\n'))
    parameters = [
        {
            'kernel': ['linear'],
            'C': [0.01, 0.1, 1, 10, 100]
        }
    ]
    optimalLinearSVM = GridSearchCV(svm.SVC(decision_function_shape = 'ovr'),
                                parameters, cv = 3, n_jobs = -1, verbose = 0)
    optimalLinearSVM.fit(trainSentencesVectors, trainLabels)
    
    # SVM Radial
    print(('\t\t*************************************************************\n'
           '\t\tSVM Radial\n'
           '\t\t*************************************************************\n'
           '\n'))
    parameters = [
        {
            'kernel': ['rbf'],
            'gamma': [1e-4, 1e-3, 1e-2, 1e-1],
            'C': [0.01, 0.1, 1, 10, 100]
        }
    ]
    optimalRadialSVM = GridSearchCV(svm.SVC(decision_function_shape = 'ovr'),
                                parameters, cv = 3, n_jobs = -1, verbose = 0)
    optimalRadialSVM.fit(trainSentencesVectors, trainLabels)
    
    # SVM Polinomico
    print(('\t\t*************************************************************\n'
           '\t\tSVM Polinómico\n'
           '\t\t*************************************************************\n'
           '\n'))
    parameters = [
        {
            'kernel': ['poly'],
            'gamma': [1e-4, 1e-3, 1e-2, 1e-1],
            'C': [0.01, 0.1, 1, 10, 100],
            'degree': [2, 3]
        }
    ]
    optimalPolySVM = GridSearchCV(svm.SVC(decision_function_shape = 'ovr'),
                              parameters, cv = 3, n_jobs = -1, verbose = 0)
    optimalPolySVM.fit(trainSentencesVectors, trainLabels)
    
    # Comparacion
    svmsWithW2V['SVM Lineal con W2V {}'.format(vectorDimension)] = \
        svm.SVC(kernel = optimalLinearSVM.best_params_['kernel'],
                C = optimalLinearSVM.best_params_['C'])
        
    svmsWithW2V['SVM Radial con W2V {}'.format(vectorDimension)] = \
        svm.SVC(kernel = optimalRadialSVM.best_params_['kernel'],
                C = optimalRadialSVM.best_params_['C'],
                gamma = optimalRadialSVM.best_params_['gamma'])
        
    svmsWithW2V['SVM Polinómico con W2V {}'.format(vectorDimension)] = \
        svm.SVC(kernel = optimalPolySVM.best_params_['kernel'],
                C = optimalPolySVM.best_params_['C'],
                gamma = optimalPolySVM.best_params_['gamma'],
                degree = optimalPolySVM.best_params_['degree'])
    
    
    results = compareSVMClassifiers(svmsWithW2V,
                                    trainSentencesVectors, trainLabels,
                                    testSentencesVectors, testLabels)
    if comparisonResults is None:
        comparisonResults = results.copy()
    else:
        comparisonResults = comparisonResults.append(results)
        
    
    # Redes neuronales convolucionales
    
    print(('\t\t*************************************************************\n'
           '\t\tRNC\n'
           '\t\t*************************************************************\n'
           '\n'))
    
    # Redimensionando los vectores para ser alimentados a la RNC
    trainVectors = trainSentencesVectors.reshape((trainSentencesVectors.shape[0], 1,
                                         trainSentencesVectors.shape[1]))
    testVectors = testSentencesVectors.reshape((testSentencesVectors.shape[0], 1,
                                       testSentencesVectors.shape[1]))
    
    # Conversión de las etiquetas (One-Hot)
    trainLabelsCat = keras.utils.np_utils.to_categorical(trainLabels, 3)
    testLabelsCat = keras.utils.np_utils.to_categorical(testLabels, 3)
    
    def buildModel(inputDimension):
        model = Sequential()

        model.add(Conv1D(128, 1, input_shape = (1, inputDimension),
                         activation = 'relu'))
        model.add(Conv1D(32, 1, activation = 'relu'))

        model.add(Flatten())

        model.add(Dropout(0.5))

        model.add(Dense(64, activation = 'relu'))
        model.add(Dense(32, activation = 'relu'))    
        model.add(Dense(16, activation = 'relu'))
        model.add(Dense(3, activation = 'sigmoid'))

        model.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy',
                      metrics = ['accuracy'])

        return model
    
    modelFileName = 'modelo.hdf5'
    
    # Checkpoint (número 1 en la lista anterior)
    checkpointer = ModelCheckpoint(filepath = modelFileName, 
                                   monitor = 'val_acc', verbose = 0,
                                   save_best_only = True, mode = 'auto')

    # Reducción de taza de aprendizaje (número 2 de la lista anterior)
    learningRateReducer = ReduceLROnPlateau(monitor = 'val_acc', factor = 0.2,
                                            patience = 5, min_lr = 0.001,
                                            verbose = 0)

    # Detener el entrenamiento al alcanzar el "óptimo"
    # (número 3 de la lista anterior)
    earlyStop = EarlyStopping(monitor = 'val_acc', patience = 50, verbose = 0,
                              mode = 'auto')
    
    model = buildModel(trainVectors.shape[2])
    model.fit(trainVectors, trainLabelsCat, batch_size = 32,
               validation_data = (testVectors, testLabelsCat), epochs = 1000,
               callbacks = [checkpointer, learningRateReducer, earlyStop],
               verbose = 0, shuffle = False)
    
    # Contrucción del modelo
    model = buildModel(trainVectors.shape[2])

    # Carga delos pesos
    model.load_weights(modelFileName)
    acc = model.evaluate(trainVectors, trainLabelsCat, verbose = 0)[1]
    valacc = model.evaluate(testVectors, testLabelsCat, verbose = 0)[1]
    culumnNames = ['Clasificador', 'Exactitud Entrenamiento',
                   'Exactitud Prueba']
    results = pd.DataFrame(data = np.zeros(shape = (1, 3)),
                           columns = culumnNames)
    results.loc[0, 'Clasificador'] = 'RNC con W2V {}'.format(vectorDimension)
    results.loc[0, 'Exactitud Entrenamiento'] = acc
    results.loc[0, 'Exactitud Prueba'] = valacc
        
    comparisonResults = comparisonResults.append(results)
    
    

################################################################################



################################################################################
## PRUEBAS CON COUNTVECTORIZER #################################################
################################################################################

print(('\n\n*****************************************************************\n'
       'INICIANDO PRUEBAS CON CountVectorizer\n'
       '\n\n*****************************************************************\n'
       '\n'))

vectorizer = CountVectorizer(analyzer = 'word', tokenizer = tokenize,
                             lowercase = True, strip_accents = 'unicode',
                             stop_words = spanishStopWords, ngram_range = (1,2))

# Entrenamiento
vectorizer.fit(trainSentences)

# Oraciones a vectores
trainVectors = vectorizer.transform(trainSentences).toarray()
testVectors = vectorizer.transform(testSentences).toarray()

vectorDimension = trainVectors.shape[1]

# SVM Lineal

print(('\n\t***************************************************************\n'
       '\tSVM Lineal\n'
       '\t***************************************************************\n'
       '\n'))
parameters = [
    {
        'kernel': ['linear'],
        'C': [0.01, 0.1, 1, 10, 100]
    }
]
optimalLinearSVM = GridSearchCV(svm.SVC(decision_function_shape = 'ovr'),
                                parameters, cv = 3, n_jobs = -1, verbose = 0)
optimalLinearSVM.fit(trainSentencesVectors, trainLabels)
    
# SVM Radial

print(('\t***************************************************************\n'
       '\tSVM Radial\n'
       '\t***************************************************************\n'
       '\n'))
parameters = [
    {
        'kernel': ['rbf'],
        'gamma': [1e-4, 1e-3, 1e-2, 1e-1],
        'C': [0.01, 0.1, 1, 10, 100]
    }
]
optimalRadialSVM = GridSearchCV(svm.SVC(decision_function_shape = 'ovr'),
                                parameters, cv = 3, n_jobs = -1, verbose = 0)
optimalRadialSVM.fit(trainSentencesVectors, trainLabels)
    
# SVM Polinomico
print(('\t***************************************************************\n'
       '\tSVM Polinómico\n'
       '\t***************************************************************\n'
       '\n'))
parameters = [
    {
        'kernel': ['poly'],
        'gamma': [1e-4, 1e-3, 1e-2, 1e-1],
        'C': [0.01, 0.1, 1, 10, 100],
        'degree': [2, 3]
    }
]
optimalPolySVM = GridSearchCV(svm.SVC(decision_function_shape = 'ovr'),
                              parameters, cv = 3, n_jobs = -1, verbose = 0)
optimalPolySVM.fit(trainSentencesVectors, trainLabels)
        
# Comparacion

svmsWithW2V = {}

svmsWithW2V['SVM Lineal con CountVectorizer'] = \
    svm.SVC(kernel = optimalLinearSVM.best_params_['kernel'],
            C = optimalLinearSVM.best_params_['C'])
        
svmsWithW2V['SVM Radial con CountVectorizer'] = \
    svm.SVC(kernel = optimalRadialSVM.best_params_['kernel'],
            C = optimalRadialSVM.best_params_['C'],
            gamma = optimalRadialSVM.best_params_['gamma'])
        
svmsWithW2V['SVM Polinómico con CountVectorizer'] = \
    svm.SVC(kernel = optimalPolySVM.best_params_['kernel'],
            C = optimalPolySVM.best_params_['C'],
            gamma = optimalPolySVM.best_params_['gamma'],
            degree = optimalPolySVM.best_params_['degree'])
    
    
results = compareSVMClassifiers(svmsWithW2V,
                                trainSentencesVectors, trainLabels,
                                testSentencesVectors, testLabels)

comparisonResults = comparisonResults.append(results)
        
    
# Redes neuronales convolucionales

print(('\t***************************************************************\n'
       '\tRNC\n'
       '\t***************************************************************\n'
       '\n'))
    
# Redimensionando los vectores para ser alimentados a la RNC
trainVectors = trainSentencesVectors.reshape((trainSentencesVectors.shape[0], 1,
                                     trainSentencesVectors.shape[1]))
testVectors = testSentencesVectors.reshape((testSentencesVectors.shape[0], 1,
                                   testSentencesVectors.shape[1]))
    
# Conversión de las etiquetas (One-Hot)
trainLabelsCat = keras.utils.np_utils.to_categorical(trainLabels, 3)
testLabelsCat = keras.utils.np_utils.to_categorical(testLabels, 3)
    
def buildModel(inputDimension):
    model = Sequential()

    model.add(Conv1D(128, 1, input_shape = (1, inputDimension),
              activation = 'relu'))
    model.add(Conv1D(32, 1, activation = 'relu'))

    model.add(Flatten())

    model.add(Dropout(0.5))

    model.add(Dense(64, activation = 'relu'))
    model.add(Dense(32, activation = 'relu'))    
    model.add(Dense(16, activation = 'relu'))
    model.add(Dense(3, activation = 'sigmoid'))

    model.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy',
                  metrics = ['accuracy'])

    return model
    
modelFileName = 'modelo.hdf5'
    
# Checkpoint 
checkpointer = ModelCheckpoint(filepath = modelFileName, 
                               monitor = 'val_acc', verbose = 0,
                               save_best_only = True, mode = 'auto')

# Reducción de taza de aprendizaje 
learningRateReducer = ReduceLROnPlateau(monitor = 'val_acc', factor = 0.2,
                                        patience = 5, min_lr = 0.001,
                                        verbose = 0)

# Detener el entrenamiento al alcanzar el "óptimo"
earlyStop = EarlyStopping(monitor = 'val_acc', patience = 50, verbose = 0,
                          mode = 'auto')
    
model = buildModel(trainVectors.shape[2])
model.fit(trainVectors, trainLabelsCat, batch_size = 32,
          validation_data = (testVectors, testLabelsCat), epochs = 1000,
          callbacks = [checkpointer, learningRateReducer, earlyStop],
          verbose = 0, shuffle = False)
    
# Contrucción del modelo
model = buildModel(trainVectors.shape[2])

# Carga delos pesos
model.load_weights(modelFileName)
acc = model.evaluate(trainVectors, trainLabelsCat, verbose = 0)[1]
valacc = model.evaluate(testVectors, testLabelsCat, verbose = 0)[1]
culumnNames = ['Clasificador', 'Exactitud Entrenamiento',
               'Exactitud Prueba']
results = pd.DataFrame(data = np.zeros(shape = (1, 3)),
                       columns = culumnNames)
results.loc[0, 'Clasificador'] = 'RNC con CountVectorizer'
results.loc[0, 'Exactitud Entrenamiento'] = acc
results.loc[0, 'Exactitud Prueba'] = valacc
    
comparisonResults = comparisonResults.append(results)

    

################################################################################



################################################################################
## MOSTRANDO LOS RESULTADOS ####################################################
################################################################################

display(comparisonResults.sort_values(by = 'Exactitud Prueba',
                                      ascending = False))

[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


*****************************************************************
INICIANDO PRUEBAS CON WORD2VEC
*****************************************************************


	***************************************************************
	W2V 200
	***************************************************************
	



100%|██████████| 480/480 [00:00<00:00, 814427.96it/s]
100%|██████████| 480/480 [00:00<00:00, 148976.31it/s]
100%|██████████| 480/480 [00:00<00:00, 4481.73it/s]
100%|██████████| 120/120 [00:00<00:00, 4481.89it/s]



		*************************************************************
		SVM Lineal
		*************************************************************


		*************************************************************
		SVM Radial
		*************************************************************


		*************************************************************
		SVM Polinómico
		*************************************************************


		*************************************************************
		RNC
		*************************************************************


