# Indonesian Tweet Emotion Classification Using Naive Bayes Classifiers on MultinomialNB and ComplementNB 

## Library

in this project i used:
1. pandas, data manipulation and analysist
2. numpy, change the data into an array
3. re, find some pattern in strings
4. string, string manipulation
5. Sastrawi, remove stopword
6. LabelEncoder, preprocessing some column to make it into numerics
7. train_test_split, spliting data into data train and test train
8. CountVectorizer and TfidfTransformer, data preparation
9. classification_report, confusion_matrix, accuracy_score, model evaluation

## Dataset
in this project i used 4,403 indonesian tweets which are already labeled in 5 categories. that are anger, happy, sadness, fear, and love.

dataset credit: https://github.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset

# Importing Library

In [23]:
import pandas as pd
import numpy as np
import re
import string
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [2]:
pd.options.display.max_colwidth = 500

In [3]:
df = pd.read_csv('Twitter_Emotion_Dataset.csv')

In [4]:
# Data is imbalance, there is difference count in each label
df['label'].value_counts()

anger      1101
happy      1017
sadness     997
fear        649
love        637
Name: label, dtype: int64

# Label Encoding

In [5]:
le = LabelEncoder()
df['label_id'] = le.fit_transform(df['label'])

In [6]:
df

Unnamed: 0,label,tweet,label_id
0,anger,"Soal jln Jatibaru,polisi tdk bs GERTAK gubernur .Emangny polisi tdk ikut pmbhasan? Jgn berpolitik. Pengaturan wilayah,hak gubernur. Persoalan Tn Abang soal turun temurun.Pelik.Perlu kesabaran. [USERNAME] [USERNAME] [URL]",0
1,anger,"Sesama cewe lho (kayaknya), harusnya bisa lebih rasain lah yang harus sibuk jaga diri, rasain sakitnya haid, dan paniknya pulang malem sendirian. Gimana orang asing? Wajarlah banyak korban yang takut curhat, bukan dibela malah dihujat.",0
2,happy,"Kepingin gudeg mbarek Bu hj. Amad Foto dari google, sengaja, biar teman-teman jg membayangkannya. Berbagi itu indah.",2
3,anger,"Jln Jatibaru,bagian dari wilayah Tn Abang.Pengaturan wilayah tgg jwb dan wwnang gub.Tng Abng soal rumit,sejak gub2 , trdahulu.Skrg sedng dibenahi,agr bermnfaat semua pihak.Mohon yg punya otak,berpikirlah dgn wajar,kecuali otaknya butek.Ya kamu. [URL]",0
4,happy,"Sharing pengalaman aja, kemarin jam 18.00 batalin tiket di stasiun pasar senen, lancar, antrian tidak terlalu rame,15 menitan dan beress semua! Mungkin bisa dicoba twips, di jam-jam segitu cc [USERNAME]",2
...,...,...,...
4396,love,"Tahukah kamu, bahwa saat itu papa memejamkan matanya dan menahan gejolak dan batinnya. Bahwa papa sangat ingin mengikuti keinginanmu tapu lagi-lagi dia HARUS menjagamu?",3
4397,fear,Sulitnya menetapkan Calon Wapresnya Jokowi di Pilpres 2019 salah satunya disebabkan gemuknya partai koalisi yang mengusung petahana. Sehingga sikap kehati-hatian agar tidak ada yang terluka dari partai pengusung harus tetap dijaga #Pilpres2019 #Pilpres #Jokowi #Parpol,1
4398,anger,"5. masa depannya nggak jelas. lha iya, gimana mau jelas coba? lulusan seni, bisanya cuma nari, mau kerja apa? nari-nari doang. berapa sih, penghasilannya penari? kerja juga gak tetap~ #dontdateadancer",0
4399,happy,"[USERNAME] dulu beneran ada mahasiswa Teknik UI nembak pacarnya pas sahur di Kukusan Teknik Depok, diliput kru Katakan Cinta (dan belum pacaran mereka). Sekarang udah nikah dan punya anak. Pernah diceritain laman UI Shitposting/Divarposting juga.",2


# Data Cleansing

In [8]:
def cleaning(x):
    x = x.strip()
    x = x.lower()
    x = re.sub(r'\d+','',x)
    x = x.translate(str.maketrans('','', string.punctuation))
    stopword = StopWordRemoverFactory().create_stop_word_remover()
    x = stopword.remove(x)
    return x

In [9]:
df['tweet'] = df['tweet'].apply(lambda x: cleaning(x))

In [10]:
df

Unnamed: 0,label,tweet,label_id
0,anger,soal jln jatibarupolisi tdk bs gertak gubernur emangny polisi tdk ikut pmbhasan jgn berpolitik pengaturan wilayahhak gubernur persoalan tn abang soal turun temurunpelikperlu kesabaran username username url,0
1,anger,sesama cewe lho kayaknya harusnya lebih rasain lah harus sibuk jaga diri rasain sakitnya haid paniknya pulang malem sendirian gimana orang asing wajarlah banyak korban takut curhat bukan dibela malah dihujat,0
2,happy,kepingin gudeg mbarek bu hj amad foto google sengaja biar temanteman jg membayangkannya berbagi indah,2
3,anger,jln jatibarubagian wilayah tn abangpengaturan wilayah tgg jwb wwnang gubtng abng soal rumitsejak gub trdahuluskrg sedng dibenahiagr bermnfaat semua pihakmohon yg punya otakberpikirlah dgn wajarkecuali otaknya butekya kamu url,0
4,happy,sharing pengalaman aja kemarin jam batalin tiket stasiun pasar senen lancar antrian terlalu rame menitan beress semua mungkin dicoba twips jamjam segitu cc username,2
...,...,...,...
4396,love,tahukah kamu saat papa memejamkan matanya menahan gejolak batinnya papa sangat mengikuti keinginanmu tapu lagilagi harus menjagamu,3
4397,fear,sulitnya menetapkan calon wapresnya jokowi pilpres salah satunya disebabkan gemuknya partai koalisi mengusung petahana sikap kehatihatian tidak yang terluka partai pengusung tetap dijaga pilpres pilpres jokowi parpol,1
4398,anger,masa depannya jelas lha iya gimana mau jelas coba lulusan seni bisanya cuma nari mau kerja apa narinari doang berapa sih penghasilannya penari kerja gak tetap dontdateadancer,0
4399,happy,username dulu beneran mahasiswa teknik ui nembak pacarnya pas sahur kukusan teknik depok diliput kru katakan cinta belum pacaran sekarang udah nikah punya anak pernah diceritain laman ui shitpostingdivarposting,2


# Split Feature and Target

In [13]:
X = list(df['tweet'])
y = np.array(list(df['label_id']))

# Data Train and Data Test Spliting

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.3)

# Naive Bayes - MultinomialNB dan ComplementNB

In [16]:
from sklearn.naive_bayes import MultinomialNB, ComplementNB

# Modelling : MultinomialNB

## Data Train

In [17]:
count_vec = CountVectorizer()
X_train_c = count_vec.fit_transform(X_train)

tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_train_c)

In [19]:
# MNB
mnb_model = MultinomialNB().fit(X_train_tfidf,y_train)

In [20]:
mnb_model

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [22]:
y_train_pred = mnb_model.predict(X_train_tfidf)

In [27]:
print('classification report')
print(classification_report(y_train,y_train_pred))
print('\n')
print('confusion matrix')
print(confusion_matrix(y_train,y_train_pred))
print('\n')
print('classification report')
print(accuracy_score(y_train,y_train_pred))

classification report
              precision    recall  f1-score   support

           0       0.84      0.99      0.91       751
           1       0.99      0.71      0.83       468
           2       0.93      0.95      0.94       710
           3       0.99      0.75      0.85       446
           4       0.85      0.96      0.90       705

    accuracy                           0.90      3080
   macro avg       0.92      0.87      0.89      3080
weighted avg       0.91      0.90      0.89      3080



confusion matrix
[[743   1   2   0   5]
 [ 66 334  13   2  53]
 [ 17   1 675   2  15]
 [ 27   0  35 333  51]
 [ 27   0   0   1 677]]


classification report
0.8967532467532467


## Data Test

In [28]:
count_vec = CountVectorizer()
X_test_c = count_vec.fit_transform(X_test)

tfidf = TfidfTransformer()
X_test_tfidf = tfidf.fit_transform(X_test_c)

In [29]:
mnb_model = MultinomialNB().fit(X_test_tfidf,y_test)

In [30]:
mnb_model

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [31]:
y_test_pred = mnb_model.predict(X_test_tfidf)

In [32]:
print('classification report')
print(classification_report(y_test,y_test_pred))
print('\n')
print('confusion matrix')
print(confusion_matrix(y_test,y_test_pred))
print('\n')
print('classification report')
print(accuracy_score(y_test,y_test_pred))

classification report
              precision    recall  f1-score   support

           0       0.77      1.00      0.87       350
           1       0.99      0.55      0.70       181
           2       0.98      0.95      0.97       307
           3       0.98      0.81      0.89       191
           4       0.89      0.95      0.92       292

    accuracy                           0.89      1321
   macro avg       0.92      0.85      0.87      1321
weighted avg       0.90      0.89      0.88      1321



confusion matrix
[[349   1   0   0   0]
 [ 59  99   4   3  16]
 [ 13   0 293   0   1]
 [ 16   0   3 154  18]
 [ 16   0   0   0 276]]


classification report
0.886449659348978


# Modelling : ComplementNB

## Data Train

In [33]:
count_vec = CountVectorizer()
X_train_c = count_vec.fit_transform(X_train)

tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_train_c)

In [34]:
cnb_model = ComplementNB().fit(X_train_tfidf,y_train)

In [35]:
y_train_pred = cnb_model.predict(X_train_tfidf)

In [36]:
print('classification report')
print(classification_report(y_train,y_train_pred))
print('\n')
print('confusion matrix')
print(confusion_matrix(y_train,y_train_pred))
print('\n')
print('classification report')
print(accuracy_score(y_train,y_train_pred))

classification report
              precision    recall  f1-score   support

           0       0.96      0.99      0.98       751
           1       0.98      0.98      0.98       468
           2       0.99      0.97      0.98       710
           3       0.96      0.98      0.97       446
           4       0.98      0.96      0.97       705

    accuracy                           0.98      3080
   macro avg       0.98      0.98      0.98      3080
weighted avg       0.98      0.98      0.98      3080



confusion matrix
[[746   2   1   0   2]
 [  2 457   3   3   3]
 [ 10   2 686   6   6]
 [  4   4   2 435   1]
 [ 16   1   0   8 680]]


classification report
0.9753246753246754


## Data Test

In [37]:
count_vec = CountVectorizer()
X_test_c = count_vec.fit_transform(X_test)

tfidf = TfidfTransformer()
X_test_tfidf = tfidf.fit_transform(X_test_c)

In [38]:
cnb_model = ComplementNB().fit(X_test_tfidf,y_test)

In [39]:
y_test_pred = cnb_model.predict(X_test_tfidf)

In [40]:
print('classification report')
print(classification_report(y_test,y_test_pred))
print('\n')
print('confusion matrix')
print(confusion_matrix(y_test,y_test_pred))
print('\n')
print('classification report')
print(accuracy_score(y_test,y_test_pred))

classification report
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       350
           1       0.99      0.98      0.99       181
           2       1.00      0.99      0.99       307
           3       0.98      1.00      0.99       191
           4       1.00      0.98      0.99       292

    accuracy                           0.99      1321
   macro avg       0.99      0.99      0.99      1321
weighted avg       0.99      0.99      0.99      1321



confusion matrix
[[348   2   0   0   0]
 [  1 178   0   1   1]
 [  3   0 303   1   0]
 [  0   0   0 191   0]
 [  4   0   1   2 285]]


classification report
0.987887963663891


## Conclusion

1. MultinomialNB
    - Data Train prediction performance : 0.89675
    - Data Test prediction performance : 0.88645

2. ComplementNB
    - Data Train prediction performance : 0.97533
    - Data Test prediction performance : 0.98788

based on the classification report of both type of naive bayes algorithm, we know that both algorithm performed well. There isn't too much differences between train and test performance accuracy. But here, ComplementNB are better than MultinomialNB. MultinomialNB is an adaptation of the standard Multinomial Naive Bayes that is particulary suited for this imbalance data.