# Sentiment Analysis with product reviews from Amazon España

This exercise takes over 700,000 Amazon product reviews and trains a model in order to perform sentiment analysis on comments, determining if they are positive or negative

+ The number of stars given to a product will determine whether this comment is positive or negative. We are goin to take 4 and 5 star ratings as positive and 3 and below as negative

We are going to use a bag of words approach, which means that we are going to count the number of times each word appears in each comment and those will be our input variables or features. Our aoutput will be a binary variable (0 or 1) stating if the comment is positive or negative.

# We will follow the following steps:

1. Read the dataset
2. Divide the sample in training and test
3. Apply "HashVectorizer" to the training set, which counts the times a word appears in each comment. For each comment, a numerical vector is created with the number of times each word appears.
The vectorizer includes data preparation:
- A function that cleans a prepares the data with a Stemming in Spanish.
- I initially used a stop-word removal function, but I realized the model performed better without this.
4. Apply and compare different classification models, using cross-validation. I have chosen 2 types of models: Logistic Regression and Linear SCV. we are going to compare the performance of the models on the testing dataset. At the end, the best model will be trained with all the data (training and testing).


In [1]:
#1. Reading the data

import os

current_path = os.path.abspath(os.curdir)

print("The current Python path is " + current_path)

The current Python path is /Users/manena/PYTHON DATAHACK/Practicas/Practicas_Maria_Elena_Martinez


In [3]:
import zipfile
# El archivo debe estar en el directorio actual
zip_file = open("amazon_es_reviews.csv.zip", "rb")

inst_zip_file = zipfile.ZipFile(zip_file)
inst_zip_file.printdir()
inst_zip_file.extractall(current_path)


zip_file.close()

File Name                                             Modified             Size
amazon_es_reviews.csv                          2016-06-08 01:52:20    194495085


In [4]:
import pandas as pd
df_reviews = pd.DataFrame()
df_reviews = pd.read_csv("amazon_es_reviews.csv", sep=";" )


In [5]:
df_reviews.head(10)

Unnamed: 0,comentario,estrellas
0,"Para chicas es perfecto, ya que la esfera no e...",4.0
1,Muy floja la cuerda y el anclaje es de mala ca...,1.0
2,"Razonablemente bien escrito, bien ambientado, ...",3.0
3,Hola! No suel o escribir muchas opiniones sobr...,5.0
4,A simple vista m parecia una buena camara pero...,1.0
5,"NI para pasar el rato, los personajes no tiene...",1.0
6,el fabricante decia que es compatible con la d...,2.0
7,"el libro está en muy buenas condiciones, pero ...",3.0
8,"buen aspecto, pero le falta fortaleza. util pa...",3.0
9,Explica de forma simple y sencilla los pensami...,5.0


In [6]:
import nltk
#nltk.download_shell() #This line should be uncommented if you have never installed nltk
from nltk.corpus import stopwords # Import the stop word list


In [7]:
# 2. Dividir la muestra en training y testing
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df_reviews,
                              train_size=0.8,
                              test_size=0.2)




We will apply a Stemmer in Spanish

In [8]:
from nltk.tokenize import RegexpTokenizer

my_tokenizer = RegexpTokenizer("[\w']+")


# Our stemmer...
from nltk.stem.snowball import SpanishStemmer

stemmer_castellano = SpanishStemmer()


# The function that performs tokenization and stemming...
def tokenizer_stemmer(document):
    return [stemmer_castellano.stem(token) for token in my_tokenizer.tokenize(document)]


In [9]:
from sklearn.feature_extraction.text import HashingVectorizer

# We initialize the "HashingVectorizer", a scikit-learn tool for "bag of words".  
vectorizer = HashingVectorizer(analyzer = "word",   # wor level analysis
                             tokenizer = tokenizer_stemmer,# We prepare the data with the function we've just defined
                             preprocessor = None, 
                           #  stop_words = stopwords.words("spanish"),# we no longer elimitane stopwords
                             n_features = 10000, # Max number of features = 10000
                             strip_accents='ascii', # we eliminate Spanish accents
                             encoding = 'utf-8',
                             ngram_range = (1,3)) # We use individual words, bigrams and trigrams




In [10]:
# We obtain the bag of words of the training dataset
train_data_features = vectorizer.fit_transform(pd.DataFrame(train).loc[:,"comentario"])
# (this will take long)


In [11]:
# The target variable is = 0 if the comment has 3 or less stars and =1 if it has 4 or 5 stars
# (0: negative comment, 1: positive comment)
train_target = [1 if x > 3 else 0 for x in train["estrellas"] ]

In [12]:
# Let's have an idea of the dimensions of the results
print(train_data_features.shape)

(561956, 10000)


In [13]:
# We make a pipeline to enter it in the GridSearchCV and compare the model with different parameters
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV

log_regression = LogisticRegressionCV()

# We generate the Pipeline
log_pipeline = Pipeline(steps=[("regresion",log_regression)])

log_pipeline


Pipeline(steps=[('regresion', LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0))])

In [14]:
# Define the hiperparameters (Very simple because of memory issues)
hiperparam_pipeline_log_reg = {  
   "regresion__fit_intercept":[True], # Data is not centered
   "regresion__cv":[10, 20],
   "regresion__max_iter": [500, 600]
    }
                             
    


In [15]:
log_grid_search = GridSearchCV(estimator=log_pipeline,
                              param_grid=hiperparam_pipeline_log_reg,
                              scoring="roc_auc",
                              cv=10,
                              n_jobs=4
                             )
        

#We train the model
log_grid_search.fit( train_data_features, 
                    train_target)

#This will take a very long time

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('regresion', LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0))]),
       fit_params={}, iid=True, n_jobs=4,
       param_grid={'regresion__fit_intercept': [True], 'regresion__cv': [10, 20], 'regresion__max_iter': [500, 600]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='roc_auc', verbose=0)

## In order to compare, we transform the test set with the vectorizer and run the model on it

In [16]:
test_data_features = vectorizer.transform(test["comentario"])


In [17]:
test_target = [1 if x > 3 else 0 for x in test["estrellas"] ]
score = log_grid_search.score(test_data_features, test_target)

In [18]:
print(score)

0.897634145407


## The score is an area under the ROC curve of 89.76%

In [19]:
# We make tests
phrases = ["es perfecta para salir a pasear",
         "lo odio",
         "lo recomiendo a mis amigos, está muy bien"]
my_vector = vectorizer.transform(phrases)
log_grid_search.predict(my_vector)

array([1, 0, 1])

In [20]:
my_vector

<3x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 39 stored elements in Compressed Sparse Row format>

## Now we're going to try with a LinearSVC

In [27]:
from sklearn.svm import LinearSVC
import numpy as np
Cs = np.logspace(-2, 4, 7)

In [28]:
svcLinear = LinearSVC()
svc_pipeline = Pipeline(steps=[("svc",svcLinear)])

svc_pipeline

Pipeline(steps=[('svc', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [31]:
hiperparam_pipeline_svc = {  
    "svc__fit_intercept":[True], # Data is not centered
    "svc__max_iter": [600,500,400], # Trying different values, these were the best numbers of iterations
    "svc__C": Cs
    }

In [32]:
svc_grid_search = GridSearchCV(estimator=svc_pipeline,
                              param_grid=hiperparam_pipeline_svc,
                              scoring="roc_auc",
                              cv=10,
                              n_jobs=4
                             )
        

#We train the model
svc_grid_search.fit( train_data_features, 
                    train_target)
#(it takes a long time)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('svc', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=4,
       param_grid={'svc__fit_intercept': [True], 'svc__max_iter': [600, 500, 400], 'svc__C': array([  1.00000e-02,   1.00000e-01,   1.00000e+00,   1.00000e+01,
         1.00000e+02,   1.00000e+03,   1.00000e+04])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='roc_auc', verbose=0)

## We get the LinearSVCs scores

In [33]:
score_svc = svc_grid_search.score(test_data_features, test_target)

In [34]:
print(score_svc)

0.897425783599


In [35]:
svc_grid_search.best_estimator_

Pipeline(steps=[('svc', LinearSVC(C=0.10000000000000001, class_weight=None, dual=True,
     fit_intercept=True, intercept_scaling=1, loss='squared_hinge',
     max_iter=500, multi_class='ovr', penalty='l2', random_state=None,
     tol=0.0001, verbose=0))])

# We choose the Linear SVC (score of 89.74%) because it trains faster and the difference with Logistic Regression is not big

## Test the chosen model

In [36]:
phrases = ["no me gusta nada",
          "no lo recomiendo",          
          "lo usaré toda la vida"]

my_vector = vectorizer.transform(phrases)
class1 = svc_grid_search.predict(my_vector)
prediction1 = svc_grid_search.decision_function(my_vector)

class2 = log_grid_search.predict(my_vector)
prediction2 = log_grid_search.predict_proba(my_vector)

print(class1)
print(prediction1)

print(class2)
print(prediction2)


[0 0 1]
[-1.64516276 -1.5424808   0.09425527]
[0 0 1]
[[ 0.9883025   0.0116975 ]
 [ 0.98531999  0.01468001]
 [ 0.43809482  0.56190518]]


# We train the best model with the whole dataset

In [37]:
all_data_features = vectorizer.fit_transform(df_reviews.loc[:,"comentario"])


In [38]:
all_target = [1 if x > 3 else 0 for x in df_reviews["estrellas"] ]

In [39]:
svcLinear = LinearSVC()
svc_pipeline = Pipeline(steps=[("svc",svcLinear)])

svc_pipeline

Pipeline(steps=[('svc', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [41]:
hiperparam_pipeline_svc = {  
   "svc__fit_intercept":[True], # Porque los datos no están centrados
   "svc__max_iter": [500],
   "svc__C": [0.1] 
    }

In [42]:
svc_grid_search = GridSearchCV(estimator=svc_pipeline,
                              param_grid=hiperparam_pipeline_svc,
                              scoring="roc_auc",
                              cv=10,
                              n_jobs=4
                             )
        

#Se entrena 
svc_grid_search.fit( all_data_features, 
                    all_target)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('svc', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=4,
       param_grid={'svc__fit_intercept': [True], 'svc__max_iter': [500], 'svc__C': [0.1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='roc_auc', verbose=0)

## We finally store the model and the vectorizer in a pickle in order to use it on a web page

In [46]:
#Store the model in a pickle
import joblib
filename = "best_model.pkl"
with open(filename, 'wb') as fo:  
    joblib.dump(svc_grid_search, fo)

In [49]:
#We remove the tokenizer from the data preparer we are going to store for using it on pythonanywhere
#we will execute the tokenizer on the input phrase before applying the model
#(For some reason, there is an error on pythonanywhere if we try to tokenize from the pickle stored vectorizer)
vectorizer.tokenizer = None 

filename = "prepare_data_no_tokenizer.pkl"
with open(filename, 'wb') as fo:  
    joblib.dump(vectorizer, fo)