# Detecção de sentimento pelo twitter

- Dados envolvendo Minas Gerais/MG
- Dataset tem alguns dados em espanhol
- Envolve política, o que pode enviesar a análise
- Existem vogais acentuadas no texto

In [1]:
import pandas as pd
from tqdm import tqdm

In [2]:
data = pd.read_csv('tweets_mg.csv', index_col=0)
data = data.dropna(axis=1)
print(data.shape)
data.head()

(8199, 6)


Unnamed: 0,Created At,Text,Username,User Screen Name,Retweet Count,Classificacao
0,Sun Jan 08 01:22:05 +0000 2017,���⛪ @ Catedral de Santo Antônio - Governador ...,Leonardo C Schneider,LeoCSchneider,0,Neutro
1,Sun Jan 08 01:49:01 +0000 2017,"� @ Governador Valadares, Minas Gerais https:/...",Wândell,klefnews,0,Neutro
2,Sun Jan 08 01:01:46 +0000 2017,"�� @ Governador Valadares, Minas Gerais https:...",Wândell,klefnews,0,Neutro
3,Wed Jan 04 21:43:51 +0000 2017,��� https://t.co/BnDsO34qK0,Ana estudando,estudandoconcur,0,Neutro
4,Mon Jan 09 15:08:21 +0000 2017,��� PSOL vai questionar aumento de vereadores ...,Emily,Milly777,0,Negativo


In [3]:
data.nunique()

Created At          7945
Text                5765
Username            3907
User Screen Name    3966
Retweet Count        113
Classificacao          3
dtype: int64

## Pré processamento dos tweets

Muitos tweets possuem caracteres que atrapalham a análise. Essa seção pretende resolver esses problemas filtrando os caracteres desnecessários do texto.

In [4]:
import re

class Filtro:
    def __init__(self):
        pass
    
    def fit(self, x):
        return x
    
    def transform(self, x):
        import numpy as np
        
        x = x.copy()
        
        return np.vectorize(self.filtro_texto)(x)

    
    @staticmethod
    def filtro_texto(text):
        
        # Filtra menções
        text = re.sub(r'@\w+', '', text)

        # Filtra URLs
        text = re.sub(r'http.?://[^\s]+[\s]?', '', text)

        # Filtra tudo o que não são letras
        text = re.sub('[^a-zA-Z\s]', '', text)

        # Retira espaços extras
        text = re.sub("\s{2,}", '', text)
        text = text.lstrip()
        text = text.rstrip()

        # Deixa todo o texto em minúsculo
        text = text.lower()

        return text



In [5]:
f = Filtro()

In [6]:
data['Text'] = f.transform(data['Text'].values)

In [7]:
data.head()

Unnamed: 0,Created At,Text,Username,User Screen Name,Retweet Count,Classificacao
0,Sun Jan 08 01:22:05 +0000 2017,catedral de santo antniogovernador valadaresmg,Leonardo C Schneider,LeoCSchneider,0,Neutro
1,Sun Jan 08 01:49:01 +0000 2017,governador valadares minas gerais,Wândell,klefnews,0,Neutro
2,Sun Jan 08 01:01:46 +0000 2017,governador valadares minas gerais,Wândell,klefnews,0,Neutro
3,Wed Jan 04 21:43:51 +0000 2017,,Ana estudando,estudandoconcur,0,Neutro
4,Mon Jan 09 15:08:21 +0000 2017,psol vai questionar aumento de vereadores e pr...,Emily,Milly777,0,Negativo


## Tokenização e stem

In [8]:
class Processamento:
    def __init__(self):
        from nltk.stem import RSLPStemmer
        
        stop_words = pd.read_csv('stopwords_pt.txt', names=['words']).iloc[:,0]
        self.stop_words = stop_words.apply(lambda x: x.rstrip())
        
        self.st = RSLPStemmer()
    
    def fit(self, x):
        return x
    
    def transform(self, x):
        import re
        import numpy as np
        
        x = x.copy()
        
        return np.vectorize(self.stemming)(x)
    
    def stemming(self, text):
    #     doc = nlp(text)
    #     tokenized = [token.text for token in doc if token.text not in stop_words.values]

    #     token = text.split(' ')
    #     token = [st.stem(tk) for tk in token if (tk not in stop_words.values) and (len(tk) > 0)]

        f = lambda x: self.st.stem(x) if len(x) > 0 else None

        text = f(text)

        return text


In [9]:
proc = Processamento()

In [10]:
data['token'] = proc.transform(data['Text'].values)

In [11]:
data.head()

Unnamed: 0,Created At,Text,Username,User Screen Name,Retweet Count,Classificacao,token
0,Sun Jan 08 01:22:05 +0000 2017,catedral de santo antniogovernador valadaresmg,Leonardo C Schneider,LeoCSchneider,0,Neutro,catedral de santo antniogovernador valadaresmg
1,Sun Jan 08 01:49:01 +0000 2017,governador valadares minas gerais,Wândell,klefnews,0,Neutro,governador valadares minas ger
2,Sun Jan 08 01:01:46 +0000 2017,governador valadares minas gerais,Wândell,klefnews,0,Neutro,governador valadares minas ger
3,Wed Jan 04 21:43:51 +0000 2017,,Ana estudando,estudandoconcur,0,Neutro,
4,Mon Jan 09 15:08:21 +0000 2017,psol vai questionar aumento de vereadores e pr...,Emily,Milly777,0,Negativo,psol vai questionar aumento de vereadores e pr...


## Extração de features

Para esse primeiro modelo, usaremos um simples bag of words

In [12]:
data = data.dropna(subset=['token'])
data = data[['token', 'Classificacao']]
print(data.shape)
data.head()

(8199, 2)


Unnamed: 0,token,Classificacao
0,catedral de santo antniogovernador valadaresmg,Neutro
1,governador valadares minas ger,Neutro
2,governador valadares minas ger,Neutro
3,,Neutro
4,psol vai questionar aumento de vereadores e pr...,Negativo


In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=5, max_df=.75)


In [14]:
X = vectorizer.fit_transform(data['token'])
y = data['Classificacao']

print(X.shape)
print(y.shape)

(8199, 1739)
(8199,)


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(6149, 1739) (6149,)
(2050, 1739) (2050,)


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# clf = LogisticRegression(C=0.5)
clf = RandomForestClassifier(max_depth=10, min_samples_leaf=10)

clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, min_samples_leaf=10)

In [17]:
from sklearn.metrics import classification_report

print('Resultados para o conjunto de treino\n\n')

print(classification_report(y_train, clf.predict(X_train),))


print('Resultados para o conjunto de teste\n\n')

print(classification_report(y_test, clf.predict(X_test),))

Resultados para o conjunto de treino


              precision    recall  f1-score   support

    Negativo       0.98      0.79      0.88      1842
      Neutro       0.81      0.72      0.76      1840
    Positivo       0.78      0.96      0.86      2467

    accuracy                           0.84      6149
   macro avg       0.86      0.82      0.83      6149
weighted avg       0.85      0.84      0.84      6149

Resultados para o conjunto de teste


              precision    recall  f1-score   support

    Negativo       0.98      0.77      0.86       604
      Neutro       0.80      0.73      0.76       613
    Positivo       0.78      0.96      0.86       833

    accuracy                           0.83      2050
   macro avg       0.86      0.82      0.83      2050
weighted avg       0.85      0.83      0.83      2050



In [18]:
df_features = pd.DataFrame(columns=['feature', 'importance'])
df_features['feature'] = vectorizer.get_feature_names()
df_features['importance'] = clf.feature_importances_
df_features = df_features.sort_values('importance', ascending=False)
df_features.head()

Unnamed: 0,feature,importance
701,governo,0.069469
295,compra,0.05966
203,calamidade,0.056612
468,drogas,0.043232
631,financeira,0.042139


## Construção da pipeline

In [19]:
from sklearn.pipeline import Pipeline

In [20]:
p = Pipeline([('Filtro', f), ('Processamento', proc), ('Vetorizacao', vectorizer), ('Modelo', clf)])

## Salvando modelo e parâmetros

In [21]:
import logging_mlflow as lm

In [22]:
proj_tags = {'Cientista': 'Helder'}
mlflow_uri = 'http://web:5000'
experiment = 'sentimento_twitter_v0'


logger = lm.LogMLflow(mlflow_uri=mlflow_uri, proj_tags=proj_tags, mlflow_experiment=experiment)

In [37]:
logger.send_logs(
    model_params={
        'modelo': 'RandomForest',
        'max_depth': 10,
        'min_samples_leaf': 10,
    },
    training_metrics={
        'f1-score_train': .84,
        'f1-score_test': .84,
    }
)


Logger MLFlow. Experiment: sentimento_twitter_v0

In [27]:
import os
os.environ['AWS_ACCESS_KEY_ID'] = "minioadmin"
os.environ['AWS_SECRET_ACCESS_KEY'] = "minioadmin"

In [30]:
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

In [None]:
p

In [38]:
logger.save_model(p, 'sklearn', 'analisador_sentimento')

Registered model 'analisador_sentimento' already exists. Creating a new version of this model...
2021/05/25 23:49:18 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: analisador_sentimento, version 2
Created version '2' of model 'analisador_sentimento'.


In [None]:
p.predict(['Teste'])[0]