### Analisis de Sentimiento a tweets en Español con el clasificador Naive Bayes

#### Tweets obtenidos de base de datos con tweets recolectados en español de usuarios con geolocalizacion en Guatemala

### tweets class
* 0 = negativo
* 1 = positivo
* 2 = neutral

### Imports:

In [1]:
import MySQLdb
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.model_selection import cross_val_score

### Retrieves data from db:

In [36]:
#Retrieve tweets from db
conn = MySQLdb.connect("13.58.190.139","root","123","tesis" )
data = pd.read_sql("select * from tweets where class is not null limit 3650", conn)
data_copy = data

### Split data:

In [37]:
#Split label from dataset
y = data_copy["class"]
X = data_copy["text"]

#Split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Import stop words:

In [38]:
# Import spanish stopword
spanish_stopwords = stopwords.words('spanish')
# Spanish stemmer
stemmer = SnowballStemmer('spanish')
analyzer = CountVectorizer(stop_words = spanish_stopwords).build_analyzer()

In [39]:
# Applies stemmer function to text
def customized_analyzer(doc):
    stemmed_doc = []
    for text in doc:
        word_list = ''
        for word in analyzer(text):
            item = str(stemmer.stem(word))
            word_list = word_list + " " + item
        stemmed_doc.append(word_list)
    return stemmed_doc


### Train and test classifier:

In [40]:
# Import spanish stopword
spanish_stopwords = stopwords.words('spanish')

vectorizer = CountVectorizer(
                analyzer = 'word',
                lowercase = True,
                ngram_range = (1,3),
                stop_words = spanish_stopwords)

In [41]:
# Bag of Words from training set
X_train_counts = vectorizer.fit_transform((X_train))

In [42]:
# Train classifier with TF-IDF words weigth
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [43]:
# Classifier
nv_classifier = MultinomialNB().fit(X_train_tfidf, y_train)

In [44]:
# Fit classifier with test set
X_new_counts = vectorizer.transform((X_test))
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = nv_classifier.predict(X_new_tfidf)

In [45]:
#Check accuracy
np.mean(predicted == y_test)  

0.65498357064622126

### Print results:

In [46]:
#Decode Labels from predicted output
def decode_predicted(predicted_value):
    predict_decode = []
    for value in predicted_value:
        if value == 0:
            predict_decode.append("Negativo")
        else:
            if value == 1:
                predict_decode.append("Positivo")
            else:
                predict_decode.append("Neutral")
    return predict_decode

In [47]:
#Remove index from Series
test_tweets = X_test.reset_index()
predict_decode = decode_predicted(predicted)
predicted_serie = pd.Series(predict_decode, index=None)

#Convert Series to DataFrame
df = pd.DataFrame(test_tweets, columns=['text'])
df2 = predicted_serie.to_frame(name='predicted')
df['predicted']=df2.values

In [48]:
#Display results
header_style = dict(selector="th", props=[('text-align', 'left')])
pd.set_option('display.max_colwidth',140)
df.style.set_properties(**{'text-align':'left'}).set_table_styles([header_style])
df.tail(10)

Unnamed: 0,text,predicted
903,me gustan las bodas rosa boda guatemala tu boda sonada URL,Neutral
904,estar que bonito verbo,Neutral
905,arabia saudita octava seleccion clasificada rusia2018 URL,Neutral
906,que tiernos kcacolombia valentinazeneretrendy kcaargentina valentinazenere michaelronda URL,Neutral
907,mejor me voy a mudar a r lyeh,Neutral
908,el problema de la comida rapida es que es rapida menos cuando la necesitas la hora del almuerzo URL,Neutral
909,fue de ayer y hasta ahorita la lei pinshi pamela,Neutral
910,feliz martes,Positivo
911,maria chula como marca y como nombre no deja de ser un nombre propio de origen hebreo un adjetivo que representa URL,Neutral
912,cuando acabe la residencia me voy a dedicar a las artes,Neutral


-------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------
### Classification code using Pipeline:

In [49]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words=spanish_stopwords)),
                      #('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),])

In [50]:
text_clf.fit(customized_analyzer(X_train), y_train)  
predicted = text_clf.predict(customized_analyzer(X_test))
np.mean(predicted == y_test) 

# Accuracy NOT using stemmer function: 0.4819
# Accuracy setting n_grams range from 1-3: 0.4819

0.68236582694414016

In [51]:
# Score del classificador
text_clf.score(customized_analyzer(X_test), y_test)

0.68236582694414016

In [52]:
# Print cross validation score
scores = cross_val_score(text_clf, X_train, y_train, cv=5)
scores

array([ 0.66302368,  0.68978102,  0.70985401,  0.66849817,  0.66117216])

-------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------
### Probando Classificadores

In [53]:
# Gaussian Naive Bayes Classifier usando TDIDF
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(X_train_tfidf.toarray(), y_train).predict(X_new_tfidf.toarray())
np.mean(y_pred == y_test)  

0.57721796276013149

In [54]:
# Gaussian Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(X_train_counts.toarray(), y_train).predict(X_new_counts.toarray())
np.mean(y_pred == y_test)  

0.57721796276013149

In [55]:
# Bernulli Naive Bayes Classifier usando TDIDF
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X_train_tfidf, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
y_pred =(clf.predict(X_new_tfidf))
np.mean(y_pred == y_test) 

0.64403066812705367

In [56]:
# Bernulli Naive Bayes Classifier
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X_train_counts, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
y_pred =(clf.predict(X_new_counts))
np.mean(y_pred == y_test) 

0.64403066812705367

### NOTAS: 
* Utilizar TF-IDF en texto reduce accuracy.
* Utilizar steemr en texto reduce accuracy.