# Análise de Sentimento utilizando Regressão Logística

## Introdução

### Sobre o conjunto de dados

- O dataset utilizado será do IMDB contendo varias avaliações de filmes, está disponível em: http://ai.stanford.edu/~amaas/data/sentiment

  - O dataset contem 25000 avaliações positivas(label=1) e 25000 avaliações negativas(label=0)
  - O conjuto de dados possui apenas duas colunas: review(avaliação) e sentiment(sentimento)


### Modelo de regressão logística

### Importando as bibliotecas necessárias para o projeto
---

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from nltk.stem.porter import PorterStemmer
import pickle


In [None]:
# Carregando o dataset
dataset_uri = "https://raw.githubusercontent.com/marcostark/Learning-Data-Science/master/desafios/datasets/imdb_movie_data.csv"

df_movies = pd.read_csv(dataset_uri)
df_movies.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [None]:
# Total de avalições
print(df_movies.shape)

(50000, 2)


In [None]:
df_movies.columns

Index(['review', 'sentiment'], dtype='object')

In [None]:
# Labels do dataset: (label=1), Negativo(label=0)
df_movies.set_index(['review', 'sentiment']).count(level='sentiment')

0
1


In [None]:
# Número de labels que representam sentimento positivo
df_movies[df_movies.sentiment==1].count()

review       25000
sentiment    25000
dtype: int64

In [None]:
# Número de labels que representam sentimento negativo
df_movies[df_movies.sentiment==0].count()

review       25000
sentiment    25000
dtype: int64

## Transformando documentos em vetores

In [None]:
count = CountVectorizer()

docs = np.array(['The sun is shinnig',
                 'The weather is sweet',
                 'The sun if shinning, the weather is sweet and one and one is two'])

bag = count.fit_transform(docs)

In [None]:
print(bag.toarray())

[[0 0 1 0 1 0 1 0 1 0 0]
 [0 0 1 0 0 0 0 1 1 0 1]
 [2 1 2 2 0 1 1 1 2 1 1]]


In [None]:
term = df_movies.loc[0,'review'][-50:]
term

'is seven.<br /><br />Title (Brazil): Not Available'

## Preparação dos dados

In [None]:
import re

def preprocessor(text):
  text = re.sub('<[^>]*>','',text)
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
  text = re.sub('[\W]+',' ', text.lower()) +\
    ' '.join(emoticons).replace('-', '')
  return text

In [None]:
print(preprocessor(term))

is seven title brazil not available


In [None]:
print(preprocessor("</a>This ;) is a :( test :-)!"))

this is a test ;) :( :)


In [None]:
df_movies['review'] = df_movies['review'].apply(preprocessor)

In [None]:
df_movies['review']

0        in 1974 the teenager martha moxley maggie grac...
1        ok so i really like kris kristofferson and his...
2         spoiler do not read this if you think about w...
3        hi for all the people who have seen this wonde...
4        i recently bought the dvd forgetting just how ...
                               ...                        
49995    ok lets start with the best the building altho...
49996    the british heritage film industry is out of c...
49997    i don t even know where to begin on this one i...
49998    richard tyler is a little boy who is scared of...
49999    i waited long to watch this movie also because...
Name: review, Length: 50000, dtype: object

## Etapa de tokenização dos dados

- Consiste oo processo que divide uma sentença em unidades mais básicas

In [None]:
porter = PorterStemmer()

def tokenizer(text):
  return text.split()

def tokenizer_porter(text):
  return [porter.stem(word) for word in text.split()]

In [None]:
tokenizer('Luminous beings are we. Not this crude matter.')

['Luminous', 'beings', 'are', 'we.', 'Not', 'this', 'crude', 'matter.']

In [None]:
tokenizer_porter('Luminous beings are we. Not this crude matter.')

['lumin', 'be', 'are', 'we.', 'not', 'thi', 'crude', 'matter.']

In [None]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('Luminous beings are we. Not this crude matter.')[-10:] if w not in stop]

['lumin', 'we.', 'thi', 'crude', 'matter.']

In [None]:
df_movies.review.head()

0    in 1974 the teenager martha moxley maggie grac...
1    ok so i really like kris kristofferson and his...
2     spoiler do not read this if you think about w...
3    hi for all the people who have seen this wonde...
4    i recently bought the dvd forgetting just how ...
Name: review, dtype: object

## Transformando documentos em vetores TF-IDF

- TF-IDF (Term Frequency - Inverse Document Frequency) - utilizado para diminuir
a importância das palabreas exibidas em muitos documentos em comum, que são consideradas de discernir os documentos, em vez de simplesmente contas a frequência das oalavras, como é feito com o CountVectorizer.

In [None]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        tokenizer = tokenizer_porter,
                        use_idf=True,
                        norm='l2',
                        smooth_idf=True)

y = df_movies.sentiment.values
X = tfidf.fit_transform(df_movies.review)

## Classficando documentos utilizando modelo de regressão logística

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.5, shuffle=False)

In [None]:
from sklearn.linear_model import LogisticRegressionCV

clf = LogisticRegressionCV(
    cv=5,
    scoring='accuracy',
    random_state=0,
    n_jobs=-1,
    verbose=3,
    max_iter=300)

clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.5min finished


LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=300, multi_class='auto', n_jobs=-1, penalty='l2',
                     random_state=0, refit=True, scoring='accuracy',
                     solver='lbfgs', tol=0.0001, verbose=3)

### Aprende o vocabulário do vetorizador com base nos parametros de treinamento , esse vectorizer será salvo para ser aplicado em uma nova sentença.

In [None]:
## Salvando vectorizer em um arquivo
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf, f)

## Salvando modelo em um arquivo

In [None]:
## Salvando modelo em um arquivo
with open('sentiment_analysis_model.pkl', 'wb') as f:
    pickle.dump(clf, f)

In [None]:
print('Precisão do teste: {:.3f}'.format(clf.score(X_test, y_test)))

Precisão do teste: 0.896


## Carregando arquivos do modelo e do vetorizados para ser utilizando em novas predições

In [None]:
file_model = 'sentiment_analysis_model.pkl'
file_vectorizer = 'vectorizer.pkl'

with open(file_vectorizer, 'rb') as f:
	vectorizer = pickle.load(f)

with open(file_model, 'rb') as f:
	model = pickle.load(f)

In [None]:
# user_input = "I think I'm a good developer with really good understanding of .NET"
user_input = "I didn't like this movie, it sucks"

review = vectorizer.transform([str(user_input)])
prediction = model.predict(review)
result = 'Negativa' if prediction == 0 else 'Positiva'
output = {'Predição': result}
output

{'Predição': 'Negativa'}