# Sentiment Analysis on Twitter tweets using LSTM and Keras
<hr>

### Steps
<ol type="1">
    <li>Load the dataset (13k twitter tweets with manually marked label)</li>
    <li>Clean Dataset</li>
    <li>Encode Sentiments</li>
    <li>Split Dataset</li>
    <li>Tokenize and Pad/Truncate Tweets</li>
    <li>Build Architecture/Model</li>
    <li>Train and Test</li>
</ol>

<hr>
<i>Import all the libraries needed</i>

In [12]:
!pip install Sastrawi

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
Installing collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


In [1]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model
import re, io, json
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory # Indonesian Stemmer

<hr>
<i>Preview dataset</i>

In [2]:
data = pd.read_csv('utf8_dataset.csv')

print(data[['Tweet', 'HS']])

                                                   Tweet  HS
0      - disaat semua cowok berusaha melacak perhatia...   1
1      RT USER: USER siapa yang telat ngasih tau elu?...   0
2      41. Kadang aku berfikir, kenapa aku tetap perc...   0
3      USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...   0
4      USER USER Kaum cebong kapir udah keliatan dong...   1
...                                                  ...  ..
13164  USER jangan asal ngomong ndasmu. congor lu yg ...   1
13165                       USER Kasur mana enak kunyuk'   0
13166  USER Hati hati bisu :( .g\n\nlagi bosan huft \...   0
13167  USER USER USER USER Bom yang real mudah terdet...   0
13168  USER Mana situ ngasih(": itu cuma foto ya kuti...   1

[13169 rows x 2 columns]


<hr>
<b>Stop Word</b> is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)

<i>Declaring the Indonesian stop words</i>

In [3]:
indonesian_stopwords = pd.read_csv('stopwords.txt', sep="\n")
indonesian_stopwords = indonesian_stopwords.iloc[:, 0].values.tolist()
indonesian_stopwords

['adalah',
 'adanya',
 'adapun',
 'agak',
 'agaknya',
 'agar',
 'akan',
 'akankah',
 'akhir',
 'akhiri',
 'akhirnya',
 'aku',
 'akulah',
 'amat',
 'amatlah',
 'anda',
 'andalah',
 'antar',
 'antara',
 'antaranya',
 'apa',
 'apaan',
 'apabila',
 'apakah',
 'apalagi',
 'apatah',
 'artinya',
 'asal',
 'asalkan',
 'atas',
 'atau',
 'ataukah',
 'ataupun',
 'awal',
 'awalnya',
 'bagai',
 'bagaikan',
 'bagaimana',
 'bagaimanakah',
 'bagaimanapun',
 'bagi',
 'bagian',
 'bahkan',
 'bahwa',
 'bahwasanya',
 'baik',
 'bakal',
 'bakalan',
 'balik',
 'banyak',
 'bapak',
 'baru',
 'bawah',
 'beberapa',
 'begini',
 'beginian',
 'beginikah',
 'beginilah',
 'begitu',
 'begitukah',
 'begitulah',
 'begitupun',
 'bekerja',
 'belakang',
 'belakangan',
 'belum',
 'belumlah',
 'benar',
 'benarkah',
 'benarlah',
 'berada',
 'berakhir',
 'berakhirlah',
 'berakhirnya',
 'berapa',
 'berapakah',
 'berapalah',
 'berapapun',
 'berarti',
 'berawal',
 'berbagai',
 'berdatangan',
 'beri',
 'berikan',
 'berikut',
 'beri

<hr>

### Load and Clean Dataset

In the original dataset, the tweets are still dirty. There are still html tags, numbers, uppercase, and punctuations. This will not be good for training, so in <b>load_dataset()</b> function, beside loading the dataset using <b>pandas</b>, I also pre-process the tweets by removing html tags, non alphabet (punctuations and numbers), stop words, and lower case all of the tweets.

### Encode Sentiments
In the same function, I also encode the sentiments into integers (0 and 1). Where 0 is for negative sentiments and 1 is for positive sentiments.

In [None]:
def stemmer(text):
    # Init indonesian stemmer
    factory = StemmerFactory()
    s = factory.create_stemmer()
    result = s.stem(text)
    print(result)
    

def load_dataset():
    df = pd.read_csv('utf8_dataset.csv')
    df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=[" "," "], regex=True, inplace=True)

    x_data = df['Tweet']       # Tweets/Input
    y_data = df['HS']    # Sentiment/Output

    # PRE-PROCESS TWEETS
    x_data = x_data.apply(lambda tweet: tweet.lower())   # lower case
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda tweet: ' '.join([w for w in tweet.split() if len(w) > 2]))  # remove words that is lees than 2 chars
    x_data = x_data.str.replace('rt', '') # Remove RT
    x_data = x_data.str.replace('user', '') # Remove USER
    x_data = x_data.apply(lambda tweet: ' '.join(tweet.split()))   # Remove excess spaces
    x_data = x_data.str.strip() # Trim
    
    x_data = x_data.apply(lambda tweet: stemmer(tweet))   # stem
    
#     x_data = x_data.apply(lambda tweet: ' '.join([w for w in tweet.split() if w not in indonesian_stopwords]))  # remove stop words
    
    # ENCODE SENTIMENT -> 0 & 1
    y_data = y_data.replace(1, 1)
    y_data = y_data.replace(0, 0)

    return x_data, y_data

x_data, y_data = load_dataset()

print('Tweet')
print(x_data, '\n')
print('HS')
print(y_data)


saat semua cowok usaha lacak perhati gue loe lantas remeh perhati gue kasih khusus elo basic elo cowok bego
siapa yang telat ngasih tau elu edan sarap gue gaul dengan cigax jifla cal sama siapa noh licew juga
kadang aku berfikir kenapa aku tetap percaya pada tuhan padahal aku selalu jatuh kali kali kadang aku rasa tuhan itu ninggalkan aku sendiri ketika orangtuaku rencana pisah ketika kakak lebih pilih jadi kristen ketika aku anak ter
aku itu aku tau mata sipit tapi liat dari mana itu aku
kaum cebong kapir udah liat dongok dari awal tambah dongok lagi hahahah
bani taplak dkk
deklarasi pilkada aman dan anti hoax warga dukuh sari jabon
gue baru aja kelar watch aldnoah zero paling kampret emang endingnya karakter utama cowonya kena friendzone bray url
nah admin belanja satu lagi po baik nak makan ais kepal milo ais kepal horlicks atau cendol toping kaw kaw doket mano gerai rojak meuaku taipan depan twins baby amp romantika bank islam senawang
enak klo smbil ngewe
tidak punya jari tengah b

prabowo sudah kalah sebut bantu jokowi hanya citra adalah ratap pilu
dan yang takut dengan adzan adalah iblis
goblog itu adalah bani cebong tukang tipu jilat kuasa tahu gerakin masa bayar pakai nasi bungkus propaganda nasi bungkus memang selalu gagal
wuih cebong sewot
gera ini tekan penting kerja keras cara total untuk tingkat potensi bangsa
padahal gubernur saat ini djarot mayoritas paai politik dprd juga dukung ahok djarot payah mereka url
rezim rusak bukan baik systeamnya malah sibuk cari salah klu itu pasti ketemu aja kpk alat besok perintah ganti ulang kembali engah pernah selesai siapa organisasi orang
saat orang orang saling tuding antek aseng aseng sungguh beepuk tangan
nah loh kata anti aseng
selamat pak moga bisa terus kiprah bahkan tingkat nasional
asyikm nang saya sih ikut ulama saja
yang saya tahu bahwa ketapang banyak tka asing kerja bagai usaha masuk who tambang bouksid viral saat pilpres dulu dan bisa jadi ini karena kerja sama tsb jadi era gubernur cornelis dan wakil o

jangan anut bumi datar juga soal teori bumi bulat kan dari orang asing notabane musuh mereka
becak warga binjai hendak sita karena guna tenda beuliskan gantipresiden pks radang bukti perintah panik dan takut
lebih baik jokowi mundur
munas alim ulama ppp semarang hasil lima rekomendasi syarat dan kriteria kriteria bagi calon wakil presiden yang akan damping bpk jokowi datang romahurmuziy
luar biasa cerah nya saya rasa gagal bagai anak ekonomi wkwk
kafan asal dari budaya yahudi kenapa umat islam klaim
oleh rezim skrg atas kendali negara cina komunis mjd patron tunggal rezim anti islam jokowi
itu tanda nya mau pilkada pilpres
misi bangun masyarakat yang gotong royong toleran dan harmoni selaras nilai budaya jawa barat yang luhur jabarasyik asyikm nang
tabah pasang taufiq dan puan nurliyana yang wajar contoh jom kita beri sumbang untuk ringan beban pasang ini come twitter your things bsn bank islam nurliyana bte abdul raziff
apaansi tai baru doang belagunya udah langit inget ada karma ntar

kok pake buruh cina kan rakyat sendiri bnyak pngangguran dasar rejim kodok

haneul pada cute padahal rlnya perek yang retweet
bilang anti asing aseng nda tau wkwkwkwkwk
suriah sekolah radikal dan rakit bom plg indonesia jadi tempat pkl aman
karena kadang hidup emang kayak tai malah kadang yang berak orang lain tapi elo yang harus resin tainya tangan
banci selotipan
haha kalah yee org jahat pasti kalah
gubernurzamannow gusipulputi ganjaryasin djarotsihar hasananton kosterace nurdinsudirman karolingidot keanekaragaman budaya suku agama itu sungguh kaya kita tutur djarot hadap pesea rakercabsu
cinta orba pasti meng idola kan prabowo amp hary tanoe
nantang buka dada nyebut setan emang sudah pada habis kamus ngelawan kenegarawanan pak joko widodo presiden pada makin stresssssss nie hati kena stoke atau serang jantung nanti rakyat indonesia hilang badut politik merdeka
belah duren yaaaa gak dungu gak cebong maaf
orang orang yahudi dan nasrani yang telah kami beri kitab taurat dan injil kenal

edan yang mas newbie menang saja liat yampun kau aja mas dikit bat
nah klo pki alias paai korupsi indonesia perlu ganyang juga
perempuan kaya mending mati aja deh jelek aja gausa sok jadi make aist
jual lagi gua kaya jablay
yang bijak cuma tugas dan jajar krl doang para tumpang cuma followers doang xef nyuruh orng utk mengei sendiri mau mengei butuh banyak orng kan bangke
kyknya itu titit nya sambung trus pake accessoris
mata pas kecil sipit skrg malah engga aneh pis
ingt aku buat banci gigih kira url
waspada cina komunis yang ada indonesia
banci kampung heeee
gembel ganteng perlu kasihan karena byk cinta asik asik jos
dekalarasi pilkada aman dan anti hoax toga jemirahan jabon
setiausaha agung pas datuk takiyuddin hassan sahkan ada pihak cuba pinda lembaga pai ini pei yang kata presiden pai belum ini
emang aku najis
sastra karl mark pentol komunis rusia
otak mereka kayak bener bener bool deh mess trus tiap harinua makan silit jadi gtu
ganti rezim kacung lah cebong mah gitu kalo tanya
c

akun siti fatimah amp komen laman diringkes semua aja parah amp bejat malu moga sehat selalu tetap sabar amp semangat terus suara benar
bawang putih itu sdg kafir haram lagi hiks hiks hiks
tuju ahokers lengser prabon jokowi daripada ancurin penjara karuan url
rawan join adalah buah gera arus bawah rindu pimpin nasional dari nasionalis islam moderat cakiminthenextwapres join
onta nsd
umpan pake apa biar mancing dpt buaya
komunis kok berabtad pki kecoak makan kecoak dong
retweeted ikko pribumi kawe horas horas horas dengan guna kapal feri presiden unjung pulau samosir dlm rangkai kunjung kerja presiden sambut dgn upacara adat samosir pekik horas horas horas
mata loee picek janji jokowi kok tagih sby mata picek blok
yang dua indonesia akan beahan bahkan bisa jadi kuat ekonomi politik baru jika hun pilih bekerjasama dengan simpan beda tidak penting macam agama
sudah lebih dari tahun freepo kuasa oleh pihak asing baru masa perintah jokowi sea pak jonan jadi tri esdm freepo kita ambil alih s

<hr>

### Split Dataset
In this work, I decided to split the data into 80% of Training and 20% of Testing set using <b>train_test_split</b> method from Scikit-Learn. By using this method, it automatically shuffles the dataset. We need to shuffle the data because in the original dataset, the tweets and sentiments are in order, where they list positive tweets first and then negative tweets. By shuffling the data, it will be distributed equally in the model, so it will be more accurate for predictions.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

<hr>
<i>Function for getting the maximum tweet length, by calculating the mean of all the tweets length (using <b>numpy.mean</b>)</i>

In [6]:
def get_max_length():
    tweet_length = []
    for tweet in x_train:
        tweet_length.append(len(tweet))

    return int(np.ceil(np.mean(tweet_length)))

print(get_max_length())

75


<hr>

### Tokenize and Pad/Truncate Tweets
A Neural Network only accepts numeric data, so we need to encode the tweets. I use <b>tensorflow.keras.preprocessing.text.Tokenizer</b> to encode the tweets into integers, where each unique word is automatically indexed (using <b>fit_on_texts</b> method) based on <b>x_train</b>. <br>
<b>x_train</b> and <b>x_test</b> is converted into integers using <b>texts_to_sequences</b> method.

Each tweets has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all tweets length) using <b>tensorflow.keras.preprocessing.sequence.pad_sequences</b>.


<b>post</b>, pad or truncate the words in the back of a sentence<br>
<b>pre</b>, pad or truncate the words in front of a sentence

In [7]:
# ENCODE TWEETS
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum tweets length: ', max_length)

Encoded X Train
 [[ 942 6445 3907 ... 2244    0    0]
 [  24 1680 4906 ...  251  181 1042]
 [ 274  162  165 ...   21   87    0]
 ...
 [ 302    0    0 ...    0    0    0]
 [ 333  748 1985 ...    0    0    0]
 [ 275  170  563 ... 3113  847 9234]] 

Encoded X Test
 [[1593   11  980 ... 3663   75 3243]
 [1142  204 7021 ...  703   21  703]
 [ 511   21  262 ...   20  814 1674]
 ...
 [ 370    6 2725 ... 4991  794 5529]
 [ 325   26  600 ... 4660  328  757]
 [ 113 6536 4225 ... 1680   86  373]] 

Maximum tweets length:  12


<hr>

### Build Architecture/Model
<b>Embedding Layer</b>: in simple terms, it creates word vectors of each word in the <i>word_index</i> and group words that are related or have similar meaning by analyzing other words around them.

<b>LSTM Layer</b>: to make a decision to keep or throw away data by considering the current input, previous output, and previous memory. There are some important components in LSTM.
<ul>
    <li><b>Forget Gate</b>, decides information is to be kept or thrown away</li>
    <li><b>Input Gate</b>, updates cell state by passing previous output and current input into sigmoid activation function</li>
    <li><b>Cell State</b>, calculate new cell state, it is multiplied by forget vector (drop value if multiplied by a near 0), add it with the output from input gate to update the cell state value.</li>
    <li><b>Ouput Gate</b>, decides the next hidden state and used for predictions</li>
</ul>

<b>Dense Layer</b>: compute the input with the weight matrix and bias (optional), and using an activation function. I use <b>Sigmoid</b> activation function for this work because the output is only 0 or 1.

The optimizer is <b>Adam</b> and the loss function is <b>Binary Crossentropy</b> because again the output is only 0 and 1, which is a binary number.

In [8]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 12, 32)            745920    
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 770,817
Trainable params: 770,817
Non-trainable params: 0
_________________________________________________________________
None


<hr>

### Training
For training, it is simple. We only need to fit our <b>x_train</b> (input) and <b>y_train</b> (output/label) data. For this training, I use a mini-batch learning method with a <b>batch_size</b> of <i>128</i> and <i>5</i> <b>epochs</b>.

Also, I added a callback called **checkpoint** to save the model locally for every epoch if its accuracy improved from the previous epoch.

In [9]:
checkpoint = ModelCheckpoint(
    'models/LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

In [10]:
model.fit(x_train, y_train, batch_size = 128, epochs = 10, callbacks=[checkpoint])

Epoch 1/10
Epoch 00001: accuracy improved from -inf to 0.67252, saving model to models\LSTM.h5
Epoch 2/10
Epoch 00002: accuracy improved from 0.67252 to 0.88818, saving model to models\LSTM.h5
Epoch 3/10
Epoch 00003: accuracy improved from 0.88818 to 0.95017, saving model to models\LSTM.h5
Epoch 4/10
Epoch 00004: accuracy improved from 0.95017 to 0.97447, saving model to models\LSTM.h5
Epoch 5/10
Epoch 00005: accuracy improved from 0.97447 to 0.98234, saving model to models\LSTM.h5
Epoch 6/10
Epoch 00006: accuracy improved from 0.98234 to 0.98813, saving model to models\LSTM.h5
Epoch 7/10
Epoch 00007: accuracy improved from 0.98813 to 0.99022, saving model to models\LSTM.h5
Epoch 8/10
Epoch 00008: accuracy improved from 0.99022 to 0.99136, saving model to models\LSTM.h5
Epoch 9/10
Epoch 00009: accuracy improved from 0.99136 to 0.99298, saving model to models\LSTM.h5
Epoch 10/10
Epoch 00010: accuracy improved from 0.99298 to 0.99421, saving model to models\LSTM.h5


<keras.callbacks.History at 0x26894bbbc10>

<hr>

### Testing
To evaluate the model, we need to predict the sentiment using our <b>x_test</b> data and comparing the predictions with <b>y_test</b> (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data. Resulted an accuracy of <b>86.63%</b>

In [11]:
# y_pred = model.predict_classes(x_test, batch_size = 128)

predict_x = model.predict(x_test, batch_size = 128) 
y_pred = np.argmax(predict_x,axis=1)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

Correct Prediction: 1523
Wrong Prediction: 1111
Accuracy: 57.82080485952923


---

### Load Saved Model

Load saved model and use it to predict a tweet statement's sentiment (positive or negative).

In [14]:
loaded_model = load_model('models/LSTM.h5')

Receives a tweet as an input to be predicted

In [21]:
tweet = str(input('Tweet: '))

Tweet: Pendaftaran Capres untuk #Pilpres2019 saja belum dibuka sampai Agustus 2018...yang normatif begini politikus PDIP masih belum paham? #2019GantiPresiden


The input must be pre processed before it is passed to the model to be predicted

In [22]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
tweet = regex.sub('', tweet)
print('Cleaned: ', tweet)

words = tweet.split(' ')
filtered = [w for w in words if w not in indonesian_stopwords]
filtered = ' '.join(filtered)
filtered = [tweet.lower()]

print('Filtered: ', filtered)

Cleaned:  Pendaftaran Capres untuk Pilpres saja belum dibuka sampai Agustus yang normatif begini politikus PDIP masih belum paham GantiPresiden
Filtered:  ['pendaftaran capres untuk pilpres saja belum dibuka sampai agustus yang normatif begini politikus pdip masih belum paham gantipresiden']


Once again, we need to tokenize and encode the words. I use the tokenizer which was previously declared because we want to encode the words based on words that are known by the model.

In [23]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=get_max_length(), padding='post', truncating='post')
print(tokenize_words)

[[ 6444   591   589   319  1764   827  1788  1237  6436    67 22004  2255]]


This is the result of the prediction which shows the **confidence score** of the tweet statement.

In [24]:
result = loaded_model.predict(tokenize_words)
print(result)

[[0.96300685]]


If the confidence score is close to 0, then the statement is **negative**. On the other hand, if the confidence score is close to 1, then the statement is **positive**. I use a threshold of **0.7** to determine which confidence score is positive and negative, so if it is equal or greater than 0.7, it is **positive** and if it is less than 0.7, it is **negative**

In [25]:
if result >= 0.7:
    print('positive')
else:
    print('negative')

positive


In [26]:
tokenizer_json = token.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))