### Analisis de Sentimiento a tweets en Español con el clasificador Naive Bayes

#### Tweets obtenidos de base de datos con tweets recolectados en español de usuarios con geolocalizacion en Guatemala

### tweets class
* 0 = negativo
* 1 = positivo
* 2 = neutral

### Imports:

In [14]:
import MySQLdb
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

In [15]:
#Retrieve tweets from db
conn = MySQLdb.connect("13.58.190.139","root","123","tesis" )
data = pd.read_sql("select * from tweets where class is not null limit 2000", conn)
data_copy = data

In [16]:
#Split label from dataset
y = data_copy["class"]
X = data_copy["text"]

#Split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [17]:
# Import spanish stopword
spanish_stopwords = stopwords.words('spanish')
# Spanish stemmer
stemmer = SnowballStemmer('spanish')
analyzer = CountVectorizer(stop_words = spanish_stopwords).build_analyzer()

# Applies stemmer function to text
def customized_analyzer(doc):
    stemmed_doc = []
    for text in doc:
        word_list = ''
        for word in analyzer(text):
            item = str(stemmer.stem(word))
            word_list = word_list + " " + item
        stemmed_doc.append(word_list)
    return stemmed_doc


In [18]:
# Import spanish stopword
spanish_stopwords = stopwords.words('spanish')

vectorizer = CountVectorizer(
                analyzer = 'word',
                lowercase = True,
                ngram_range = (1,3),
                stop_words = spanish_stopwords)

In [19]:
# Bag of Words from training set
X_train_counts = vectorizer.fit_transform((X_train))

In [20]:
# Train classifier with TF-IDF words weigth
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [21]:
# Classifier
nv_classifier = MultinomialNB().fit(X_train_tfidf, y_train)

In [22]:
# Fit classifier with test set
X_new_counts = vectorizer.transform((X_test))
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = nv_classifier.predict(X_new_tfidf)

In [23]:
#Check accuracy
np.mean(predicted == y_test)  

0.48999999999999999

In [24]:
#Decode Labels from predicted output
predict_decode = []
for value in predicted:
    if value == 0:
        predict_decode.append("Negativo")
    else:
        if value == 1:
            predict_decode.append("Positivo")
        else:
            predict_decode.append("Neutral")

In [25]:
#Remove index from Series
test_tweets = X_test.reset_index()
predicted_serie = pd.Series(predict_decode, index=None)

#Convert Series to DataFrame
df = pd.DataFrame(test_tweets, columns=['text'])
df2 = predicted_serie.to_frame(name='predicted')
df['predicted']=df2.values

In [26]:
#Display results
header_style = dict(selector="th", props=[('text-align', 'left')])
pd.set_option('display.max_colwidth',140)
df.style.set_properties(**{'text-align':'left'}).set_table_styles([header_style])
df.tail(10)

Unnamed: 0,text,predicted
490,hay gente que quita hasta el hambre uish,Negativo
491,AT USER AT USER AT USER eso decia yo solo que mejor compraran 5 o 10 radiopatrullas baratas pero funcionales,Neutral
492,quiero conocer a rk ya de ya,Neutral
493,despues de unos meses ya somos dos extranos,Negativo
494,no llegues a detestarme si desaparezco,Neutral
495,presidente en un discurso les dice a sus ministros si nos llevan a la carcel q nos lleven mal ejemp jimy a ministr URL,Neutral
496,estos son los bebes de AT USER URL,Neutral
497,senor librarme y guardarme de en medio de tanto celo,Neutral
498,nada como alguien que te hace reir,Neutral
499,si quieren pasen depositando su buenos dias con un corazon si no pues igual tengan buen dia culeros,Neutral


### Classification code using Pipeline:

In [49]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words=spanish_stopwords, ngram_range = (3))),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),])

In [50]:
text_clf.fit(customized_analyzer(X_train), y_train)  
predicted = text_clf.predict(customized_analyzer(X_test))
np.mean(predicted == y_test) 
# Accuracy NOT using stemmer function: 0.4819
# Accuracy setting n_grams range from 1-3: 0.4819

TypeError: 'int' object is not iterable