In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

# SKLEARN (lib)

### Machine Learning Process
- What question are we trying to answer? 
- find data to help answer question 
- Process data
- Build Model 
- Evaluate Model 
- Improve Model Further

In [37]:
import json

file_name= './dataframes/SKLEARN/Books_small_10000.json' #caminho do arquivo json

with open(file_name) as f:
    for line in f:
        print(line) # vai printar a primeira linha do json (no seu arquivo)
        break  #para o loop em apenas uma linha

{"reviewerID": "A1F2H80A1ZNN1N", "asin": "B00GDM3NQC", "reviewerName": "Connie Correll", "helpful": [0, 0], "reviewText": "I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.", "overall": 5.0, "summary": "Can't stop reading!", "unixReviewTime": 1390435200, "reviewTime": "01 23, 2014"}



In [38]:
file_name= './dataframes/SKLEARN/Books_small_10000.json' #caminho do arquivo json


with open(file_name) as f:
    for line in f:
        review = json.loads(line) #usado para converter json string no python
        print(review['reviewText']) #vai printar apenas a coluna de review
        print(review['overall']) #vai printar a coluna de nota
        
        break  

I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.
5.0


### Criando uma classe

In [53]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"
    

class Review:
    def __init__(self, text, score):
        self.text = text #atributo texto 
        self.score = score  #atributo score 
        self.sentiment = self.get_sentiment() #sentimento da classe
    
    def get_sentiment(self): #método
        if self.score <=2:
            #return "NEGATIVE" #é uma forma de retornar negativo, mas vamos usar outra levando em consideração a classe sentiment
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #score 4 ou 5
            return Sentiment.POSITIVE
        
#criando uma classe para igualar os valores de positivo e negativos
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self): #MÉTODO de distribuição uniforme, para igualar as quatidades de dados que temos em posive e negative
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews)) #vai filtrar todos os exemplos negativos da lista review e colocar dentro de uma lista
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews)) #vai filtrar todos os exemplos positivos da lista review e colocar dentro de uma lista
        positive_shrunk = positive[:len(negative)] #reduzir os exemplos positivos para que se iguale a quantidade de exemplos negativos
        self.reviews = negative + positive_shrunk #juntar as duas novas listas: lista de review negativa e lista de review positiva que foi reduzida
        random.shuffle(self.reviews)  #vai embaralhar os dados 
    
        
sentimento = Review("a fruta é muito boa", 2)
print(sentimento.sentiment)
print(sentimento.text)
print(sentimento.score)

NEGATIVE
a fruta é muito boa
2


In [54]:
file_name= './dataframes/SKLEARN/Books_small_10000.json' #caminho do arquivo json

reviews = [] #criando uma lista vazia chamada reviews (onde vai ser adicionado os valores do loop)
with open(file_name) as f:
    for line in f:
        review = json.loads(line) #usado para converter json string no python
        reviews.append(Review(review['reviewText'], review['overall'])) #vai adicionar na lista valores de cada linha do json que serão delimitados por '()' #o Review é uma classe onde o primeiro termo é text e o segundo termo é o score (que é exatamente o que colocamos dentro dele)
        
#reviews[0:2]  #mostra o item 0 e o item 1
print(reviews[5].sentiment)  #mostra o score da classe da linha 5 (se vc mudar para text, vai mostrar o texto de review)
print(reviews[5].score)
print(reviews[5].text)

POSITIVE
5.0
I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia's trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character's voice on a strong subject and making it so that other peoples story may be heard through Mia's.


### Converting text to numerical vectors (Prep Data)

In [69]:
# sklearn train test split

#vai dividir os dados em treinamento e teste
from sklearn.model_selection import train_test_split 

training, test = train_test_split(reviews, test_size=0.33, random_state=42) #dica: shift+tab mostra os parametro que irão dentro da função
#test_size: 0,33% das reviews serão teste o os 66% serão treinamento

train_container = ReviewContainer(training) #vai pegar a base de treino total e aplicar a classe ReviewContainer

test_container = ReviewContainer(test) #vai pegar a base de treino total e aplicar o método da nova classe que vc criou



None


In [76]:
train_container.evenly_distribute() #vai deixar a quantidade de dados iguais #evenly_distribute() é o metodo da nova classe
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))




436
436


### Bag of words vectorization (Extrair recursos de arquivos de texto)

In [96]:
from sklearn.feature_extraction.text import CountVectorizer,  TfidfVectorizer

# TfidfVectorizer --> faz com que as palavras que aparecem muito tenha um peso menor do que as palavras que aparecem uma vez ou outra
#ex: this book is great, this book was so bad --> a palavra this aparece mais vezes, logo seu peso será menor

vectorizer = TfidfVectorizer() #dá peso maior na hora de vetorizar as palavras que aparecem menos
#vectorizer = CountVectorizer() 
train_x_vectors = vectorizer.fit_transform(train_x)

test_x_vectors = vectorizer.transform(test_x)

print(train_x[0]) #mensagem de texto
print(train_x_vectors[0].toarray()) #mensagem de texto transformada em vetor


The description of Detroit is an eye opener for this Michigander who hasn't been there for awhile.  Really sad.  The story did some flip flopping but was easily followed.  While set mostly in the L.P. (I read Alex McKnight because the stories are usually set in the U.P.), the reader gets to know his back story with this novel.  A really good read.
[[0. 0. 0. ... 0. 0. 0.]]


### Classification

#### Linear SVM

In [97]:
#Support Vector Machine
from sklearn import svm

clf_svm = svm.SVC(kernel='linear') #kernel executa e analisa o código do usuário

clf_svm.fit(train_x_vectors, train_y) #esse comando vai fazer o classificador SVM treinar os dados 

test_x[0] #mostra o texto contido na linha 0
#train_x_vectors[0]
clf_svm.predict(train_x_vectors[0]) #mostra a previsão do sentimento

array(['POSITIVE'], dtype='<U8')

#### Decision Tree

In [127]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
clf_dec.predict(train_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

##### Naive Bayes

In [146]:
from sklearn.naive_bayes import GaussianNB ## *******

clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)
#clf_gnb.predict(train_x_vectors.toarray())

GaussianNB()

##### Logistic Regression

In [100]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression() ##da pra colocar o peso do dado dentro: ex: clf_log = LogisticRegression(class_weight="Balanced")
clf_log.fit(train_x_vectors, train_y)

clf_log.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

### Evaluation

In [145]:
# Mean Accuracy # Acurácia: proximidade entre o valor obtido experimentalmente e o valor verdadeiro na medição de uma grandeza física.
print(clf_svm.score(test_x_vectors, test_y)) #score vai medir a acurácia de cada método
print(clf_dec.score(test_x_vectors, test_y)) #aqui vc coloca a amostra de teste!
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6634615384615384
0.6610576923076923
0.8052884615384616


In [84]:
# F1 Scores F1 é definido como a média harmônica de precisão e revocação.
from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

#test - primeiro parametro que é a resposta
#segundo parametro é o teste que ele vai prever
#label vai dizer a porcentagem de cadA um. No primeiro print, temos 91% de positivo e 22% de negativo; no segundo prit temos 91% de positivo, 12% de neutral e 10% de negativo

#no primeiro print, percebemos que o modelo é ruim para os casos de neutro e negativo --> o modelo é bom em prever positivo mas deixa a desejar ao prever neutral e negativo

#OUTROS MODELOS:
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))



[0.8028169  0.79310345]
[0.82051282 0.         0.808933  ]
[0.65411765 0.         0.63882064]


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


In [102]:
print(train_y.count(Sentiment.POSITIVE)) #vai contar quantos valores de Positivo tem nos exemplos de teste
print(train_y.count(Sentiment.NEGATIVE))
print(train_y.count(Sentiment.NEUTRAL))

print(test_y.count(Sentiment.POSITIVE)) 
print(test_y.count(Sentiment.NEGATIVE))
print(test_y.count(Sentiment.NEUTRAL))

436
436
0
208
208
0


In [103]:
test_set = ['very fun', "bad book do not buy", 'horrible waste of time', ' this book is nice', 'amazing book']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE', 'POSITIVE', 'POSITIVE'],
      dtype='<U8')

### Turning our model (with Grid Search)
- GridSearchCv to automatically find the best parameters

In [114]:
from sklearn.model_selection import GridSearchCV #TESTANDO o melhor parametro para sua base de dados

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5) #o primeiro termo é o classificador que vc quer usar e o segundo termo sao os parametros

clf.fit(train_x_vectors, train_y)

#best_parameters_
#ordena o score


GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [115]:
print(clf_svm.score(test_x_vectors, test_y))

0.8076923076923077


### Saving Model

#### Save model


In [116]:
import pickle

with open('./dataframes/SKLEARN/models/sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f) #pega o classificador e despeja todos os parametro no arquivo acima de vc criou

#### Load model

In [117]:
with open('./dataframes/SKLEARN/models/sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [157]:
print(test_x[400])

loaded_clf.predict(test_x_vectors[400])

This book is full of inspiring ideas to use in enjoying the outdoors to a fuller and more creative extent.


array(['POSITIVE'], dtype='<U8')