# LEAD UNIVERSITY
---
* **INTELIGENCIA COMERCIAL**
* **Maestría Profesional en Comercio y Mercados Internacionales**
* Alexander Franck, Guillermo Naranjo
* 17 de Noviembre, 2018

### TEMA 3-B Información cuantitativa
1. Extracción, transformación y almacenamiento datos
2. Otras técnicas y herramientas para el análisis de datos
3. Presentación de la información (informes, tablas dinámicas, gráficas)

Modified from https://lazyprogrammer.me/ course.

**OBJETIVO:** El siguiente ejercicio busca demostrar tecnicas recientes para extracción y procesamiento de información no estructurada. Particularmente se busca que el estudiante dimensione las grandes etapas del proceso, los retos y principalmente las oportunidades.

**ETAPAS DEL PROCESO:**
1. Preparación del proceso
2. Extracción de datos de internet *(AMAZON REVIEWS)*.
3. Limpieza y transformación de datos.
4. Almacenamiento de la información.
5. Entranamiento del modelo de clasificación usando regresión logística.
6. **Análisis de sentimiento** a partir de los comentarios y calificaciones de Amazon utilizando técnicas de APRENDIZAJE DE LENGUAGE NATURAL.

#### PASO1: PREPARACION DEL AMBIENTE Y CARGA DE LIBRERIAS REQUERIDAS

In [1]:
''' SE CARGAN LAS LIBRERIAS WEB '''
# FUTURE permite compatibilidad entre versiones de Python
from __future__ import print_function, division
from future.utils import iteritems
from builtins import range

#NATURAL LANGUAGE TOOLKIT provee herramientas para interpretar lenguage natural
import nltk
from nltk.stem import WordNetLemmatizer

#NUMPY es una librería optimizada para calculo de vectores entre otras cosas
import numpy as np
import pandas as pd

#SKLEARN es una libreria que ofrece herramientas para data mining y data analysis 
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression

#BEAUTIFUL SOUP permite webscrapping, una tecnica de extracción de información web.
from bs4 import BeautifulSoup
import urllib as ul

#SE IMPORTAN OTRAS LIBRERIAS UTILITARIAS
import re
import ssl
import pickle

In [2]:
''' SE DEFINEN ALGUNOS PARAMETROS GENERALES'''
have_reviews = True
page_count = 1
all_reviews = []
max_pages = 150
word_index_map = {}
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
hdr = {'User-Agent': user_agent}
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
np.seterr(divide='ignore', invalid='ignore')
pd.set_option('display.max_colwidth', 500)

In [3]:
''' SE CARGAN LOS STOPWORDS -PALABRAS A IGNORAR-'''
wordnet_lemmatizer = WordNetLemmatizer()
stopwords = set(w.rstrip() for w in open('stopwords.txt'))

In [4]:
def save_reviews(value, file='reviews.pk'):
    with open(file, 'wb') as f:
        pickle.dump(value, f)
    
def load_reviews(file='reviews.pk'):
    with open(file, 'rb') as f:
        return(pickle.load(f))

#### PASO2: SE EXTRAEN LOS COMENTARIOS DEL SIGUIENTE PRODUCTO DE AMAZON
*IMPORTANTE:* Los comentarios vienen paginados

In [5]:
url = 'https://www.amazon.com/Peets-Coffee-Aurora-Coffees-Flavors/product-reviews/B018NZJ4KK?pageNumber='

In [6]:
''' SE DESCARGAN LOS CONTENIDOS DE UNA PAGINA WEB DE REVIEWS '''
def get_reviews(url):    
    req = ul.request.Request(url, headers=hdr)
    html = ul.request.urlopen(req, context=ctx).read()
    soup = BeautifulSoup(html,'lxml')
    soup = soup.find('div',attrs={'id':'cm_cr-review_list'})

    #Encuentra los comentarios y los almacena en una lista
    comments = []
    for comment in soup.findAll('span',class_='a-size-base review-text'):
        comments.append(comment.text)

    #Encuentra las calificaciones y los almacena en una lista
    sentiment = []
    for rating in soup.findAll('span',class_='a-icon-alt'):    
        sentiment.append(1) if int(rating.text[0]) > 3 else sentiment.append(0)
    
    #unifica los comentarios y calificaciones
    l = [comments]+[sentiment]
    return(list(map(list, zip(*l))))

In [7]:
''' SE DESCARGAN TODOS LOS REVIEWS POR PAGINA '''
while (have_reviews and page_count <= max_pages):    
    print('Extracting comments from review page:',page_count)
    reviews = get_reviews(url + str(page_count))    
    if len(reviews) == 0:
        have_reviews = False            
    else:
        all_reviews += reviews
        page_count += 1

save_reviews(all_reviews)

Extracting comments from review page: 1
Extracting comments from review page: 2
Extracting comments from review page: 3
Extracting comments from review page: 4
Extracting comments from review page: 5
Extracting comments from review page: 6
Extracting comments from review page: 7
Extracting comments from review page: 8
Extracting comments from review page: 9
Extracting comments from review page: 10
Extracting comments from review page: 11
Extracting comments from review page: 12
Extracting comments from review page: 13
Extracting comments from review page: 14
Extracting comments from review page: 15
Extracting comments from review page: 16
Extracting comments from review page: 17
Extracting comments from review page: 18
Extracting comments from review page: 19
Extracting comments from review page: 20
Extracting comments from review page: 21
Extracting comments from review page: 22
Extracting comments from review page: 23
Extracting comments from review page: 24
Extracting comments from 

In [8]:
all_reviews

[['For years I roasted my own beans and the only reason I don\'t do it much lately is because I just don\'t ever have the time. So you end up too often roasting a batch of beans at 6 am so you can have a morning cup. Which means you\'re grinding and perking warm beans which is not optimal.So too much of a hassle and while I still do the occasional batch to my own particular taste, I have shopped around for a reliable substitute.The first thing I discovered is that you can pay a lot of money for some really mediocre to lousy beans. But I eventually managed tofind some primo stuff and this is one I highly recommend.I\'m not going to give you all that "floral notes and finish of absinthe" nonsense.  It\'s just good coffee, dark and rich and never bitter. These guys do a good job and you ought to give it a try. Maybe you\'ll like it and maybe it won\'t be exactly what you\'re looking for but either way you won\'t feel like you paid a premium price for a bag of mud,',
  1],
 ['After all the

#### PASO3: LIMPIEZA Y TRANSFORMACION
* Transformar oraciones en vectores de palabras
* Se removen palabras muy pequeñas
* Se generalizan familias de palabras: *Stemming and lemmatization* https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
* Se remueven palabras que no aportan mayor significado: stopwords
* TODO ESTO DERIVA EN CREAR UN DICCIONARIO DE PALABRAS CLAVE

In [9]:
''' CONVIERTE UN COMENTARIO EN UN VECTOR DE PALABRAS CLAVE '''
def tokenizer(s):
    s = s.lower() # downcase    
    tokens = nltk.tokenize.word_tokenize(s) # divide el texto en palabras (tokens)
    tokens = [t for t in tokens if len(t) > 3] # remueve palabras de menos de 3 letras
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # reduce las palabras a su base
    tokens = [t for t in tokens if t not in stopwords] # remueve palabras que no aportan valor
    return tokens

In [10]:
''' SE CREA UNA MATRIZ DE PALABRAS RELACIONADAS A CADA REVIEW '''
def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index_map) + 1) 
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x / x.sum() #Se normalizan las variables para que sean comparables 
    x[-1] = label
    return x

In [11]:
''' SE CREA UN DICCIONARIO DE PALABRAS CON LOS TOKENS DE TODOS LOS COMENTARIOS '''
all_reviews = load_reviews()
all_reviews_ = []
current_index = 0

for review in all_reviews:
    tokens = tokenizer(review[0])    
    if tokens:
        for token in tokens:
            if token not in word_index_map:
                word_index_map[token] = current_index
                current_index += 1    
        all_reviews_.append(review)

all_reviews = all_reviews_
print("tamaño del diccionario de palabras:",len(word_index_map))                        

tamaño del diccionario de palabras: 2572


In [12]:
word_index_map

{'roasted': 0,
 'bean': 1,
 'reason': 2,
 'lately': 3,
 'time': 4,
 'roasting': 5,
 'batch': 6,
 'morning': 7,
 'mean': 8,
 'grinding': 9,
 'perking': 10,
 'warm': 11,
 'optimal.so': 12,
 'hassle': 13,
 'occasional': 14,
 'particular': 15,
 'taste': 16,
 'shopped': 17,
 'reliable': 18,
 'substitute.the': 19,
 'discovered': 20,
 'money': 21,
 'mediocre': 22,
 'lousy': 23,
 'eventually': 24,
 'managed': 25,
 'tofind': 26,
 'primo': 27,
 'stuff': 28,
 'this': 29,
 'highly': 30,
 'recommend.i': 31,
 'floral': 32,
 'note': 33,
 'finish': 34,
 'absinthe': 35,
 'nonsense': 36,
 'coffee': 37,
 'dark': 38,
 'rich': 39,
 'bitter': 40,
 'guy': 41,
 'ought': 42,
 'maybe': 43,
 'exactly': 44,
 'looking': 45,
 'feel': 46,
 'paid': 47,
 'premium': 48,
 'price': 49,
 'mind': 50,
 'book': 51,
 'history': 52,
 'starbucks': 53,
 'mention': 54,
 'model': 55,
 'product': 56,
 'peet': 57,
 'care': 58,
 'flavor': 59,
 'complexity': 60,
 'etc.': 61,
 'wrong': 62,
 'welllllll': 63,
 'local': 64,
 'house': 65,


In [13]:
N = len(all_reviews)
data = np.zeros((N, len(word_index_map) + 1)) #La matriz es de tamaño N reviews x D+1 palabras
i = 0
            
for review in all_reviews:
    tokens = tokenizer(review[0])
    xy = tokens_to_vector(tokens,review[1])
    data[i,:] = xy
    i += 1    

#### PASO4: ALMACENAMIENTO DE LA INFORMACION

In [14]:
df_reviews = pd.DataFrame(all_reviews).reset_index()
df_reviews.columns = ['Index','Comment','Positive']

In [15]:
#ALGUNOS REVIEWS POSITIVOS
df_reviews[df_reviews.Positive == 1].head()

Unnamed: 0,Index,Comment,Positive
0,0,"For years I roasted my own beans and the only reason I don't do it much lately is because I just don't ever have the time. So you end up too often roasting a batch of beans at 6 am so you can have a morning cup. Which means you're grinding and perking warm beans which is not optimal.So too much of a hassle and while I still do the occasional batch to my own particular taste, I have shopped around for a reliable substitute.The first thing I discovered is that you can pay a lot of money for so...",1
1,1,"After all these years, still the best coffee you can get. Keep in mind that even the book about the history of Starbucks has a mention of how they wanted to model their product after Peet's... and Peet's just cares so much about flavor, roasting, complexity, etc., etc. that you can't go wrong. You might say ""welllllll, my local coffee house does an amazing job"", but you know what, most local coffee roasters still don't have the expertise or the equipment, or the scientists and artists behi...",1
2,2,"This is a good, strong coffee. It's not the original Major Dickason's Blend that put Peet's on the map in the late 1980's though. Back then the beans were quite rich and a real treat. As with any coffee bean over the years, crops change, so beans are not quite the same. I have found this to be true when ordering any of my favorites directly from Peet's over the past few years or via Amazon.I wouldn't call the newest incarnation of this blend bitter, but it is a tiny bit acidic. To balance th...",1
3,3,"At times, making a good espresso seems to be a fleeting event. I do find with Peet’s Major Dickason’s whole coffee beans, my results are more reliable and consistent.While I drink several other beans and constantly try others, Peet's Major Dickason's Whole Beans are my go to coffee several days a week.I am a dedicated espresso drinker, but it has taken me years to make a good espresso at home. Saying ""good"" however, is a matter of opinion. I have had good espressos at some cafes and restaura...",1
5,5,"This is a review for the addon listing of Peet's Major Dickerson blend.I've been a Peet's convert for a while now, in fact Peet's ruined my preference for medium roasts with it's Maj D blend and I was quite a coffee snob willing to spend more for quality at other website roasters typically ordering around 7 lbs a month. But when I discovered Maj D I was surprised at how rich, complex and satisfying of a roast it was. For a while I ordered from their website but I'd check in with amazon fro...",1


In [17]:
#ALGUNOS REVIEWS NEGATIVOS
df_reviews[df_reviews.Positive == 0].head()

Unnamed: 0,Index,Comment,Positive
4,4,Disappointed. Received the package 12/20 that was roasted 9/18 and best by 12/17 so past the best by date. Usually when ordering via Amazon things are much fresher. I won't be ordering this product again. Which is a shame because it is quite good. Usually you get a discount for out of date products not a full price tag.,0
9,9,"UPDATED April 2018: Peet's stop making the decaf house blend over 6 months ago. On February I saw it back here and I purchased 4 of them thinking that after it was discontinue they brought it back, big mistake, this is nothing like the old decaf house blend, it is bitter and the flavor is very bad. I wasted my money buying these. Not sure why they even call it the same if it is different coffee taste. I trashed all of them.By far one of the best decaf coffees out there. It is better if you ...",0
14,14,"Made my usual cold brew with my Toddy, and was recommended to try this by a friend. Typically I'll spend a few extra dollars for some of my favorite dark roast coffee beans, but wanted to see if I could save some money while still having a great cold brew. Unfortunately, I found this to be unusually bitter and strong with my regular 12 hour process. Perhaps this is best left as a hot brewed coffee, as there's just no body or discernible flavor notes to it.",0
20,20,Delivered 10/5/17. Product 'best purchased by' 10/11/17. Not the first time this has happened; be careful when purchasing. Always check dates before opening.,0
24,24,"Coffee beans are obviously old. No aroma, absolutely no flavor. I'm stuck with 3 more bags which are of no use to me.",0


In [18]:
df_reviews.to_csv('reviews.csv', sep='|', encoding='utf-8')

#### PASO5: ENTRENAMIENTO DEL MODELO CON LOS DATOS EXTRAIDOS Y LAS CALIFICACIONES

In [19]:
X = data[:,:-1]
Y = data[:,-1]

In [20]:
''' SE DIVIDEN LOS REVIEWS EN UN SET DE APRENDIZAJE Y UNO DE PRUEBA '''
#80% DE APRENDIZAJE Y 20% DE PRUEBA

Xtrain = X[:-int(N*0.2),]
Ytrain = Y[:-int(N*0.2),]

Xtest = X[-int(N*0.2):,]
Ytest = Y[-int(N*0.2):,]

In [21]:
''' SE UTILIZA REGRESSION LOGISTICA PARA DEFINIR LOS PESOS ADECUADOS
    QUE PREDIGAN SI UN REVIEW ES NEGATIVOS O POSITIVOS CON BASE EN LA
    COMBINACION DE PALABRAS QUE CONTENGA    '''

model = LogisticRegression()
model.fit(Xtrain, Ytrain) #Calcula los pesos

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [22]:
print("Precisión con el set de entrenamiento:", model.score(Xtrain, Ytrain))
print("Precisión con el set de prueba:", model.score(Xtest, Ytest))

Precisión con el set de entrenamiento: 0.8341323106423778
Precisión con el set de prueba: 0.7538461538461538


#### PASO6: CON LOS EL MODELO SE PERMITE PREDECIR CUALES FUTUROS REVIEWS SERAN NEGATIVOS O POSITIVOS
También se pueden observar varias palabras que definen cuando un comentario es positivo o negativo

In [23]:
''' EL MODELO RETORNARA UN COERFICIENTE NEGATIVO O POSITIVO PARA CADA PALABRA '''
''' EJEMPLO DE PALABRAS MAS POSITIVAS'''

threshold = 0.5
print('RATE','\t','WORD')
for word, index in iteritems(word_index_map):
    weight = model.coef_[0][index]
    if weight > threshold:
        print(round(weight,1),'\t',word)

RATE 	 WORD
0.6 	 morning
2.5 	 coffee
1.2 	 rich
0.9 	 price
0.7 	 peet
1.0 	 flavor
0.5 	 amazing
2.0 	 favorite
0.5 	 french
0.7 	 roast
0.8 	 perfect
1.8 	 love
0.7 	 smooth
0.6 	 store
0.8 	 nice
1.2 	 decaf
0.9 	 tasting
1.4 	 excellent
0.6 	 brand
0.8 	 bold
0.9 	 peets
1.4 	 delicious
0.6 	 tasty
0.5 	 expected
0.7 	 yummy


In [24]:
''' EJEMPLO DE PALABRAS MAS NEGATIVAS'''

threshold = -0.5
print('RATE','\t','WORD')
for word, index in iteritems(word_index_map):
    weight = model.coef_[0][index]
    if weight < threshold:
        print(round(weight,1),'\t',word)

RATE 	 WORD
-0.6 	 roasted
-0.8 	 bean
-1.3 	 bitter
-0.8 	 stale
-3.0 	 date
-0.6 	 disappointed
-0.6 	 month
-1.6 	 received
-0.6 	 bag
-0.7 	 burnt
-0.5 	 expiration
-1.1 	 expired
-0.7 	 sorry
