# Sentiment Analysis on Twitter tweets using LSTM and Keras
<hr>

### Steps
<ol type="1">
    <li>Load the dataset (13k twitter tweets with manually marked label)</li>
    <li>Clean Dataset</li>
    <li>Encode Sentiments</li>
    <li>Split Dataset</li>
    <li>Tokenize and Pad/Truncate Tweets</li>
    <li>Build Architecture/Model</li>
    <li>Train and Test</li>
</ol>

<hr>
<i>Import all the libraries needed</i>

In [3]:
!pip install Sastrawi
!pip install tensorflow



In [18]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model
import re, io, json
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory # Indonesian Stemmer
import tensorflow as tf 
from sklearn.metrics import confusion_matrix,classification_report

<hr>
<i>Preview dataset</i>

In [5]:
data = pd.read_csv('clean_dataset.csv')

print(data[['Tweet', 'HS']])

                                                   Tweet  HS
0      di saat cowok usaha lacak perhati gue kamu lan...   1
1      telat beri tau kamu edan sarap gue gaul cigax ...   0
2      kadang pikir percaya tuhan jatuh kali kali kad...   0
3                                   tau mata sipit lihat   0
4        kaum cebong kafir sudah lihat dongok dungu haha   1
...                                                  ...  ..
13164                   bicara ndasmu congor kate anjing   1
13165                                  kasur enak kunyuk   0
13166                           hati hati bisu bosan duh   0
13167  bom real mudah deteksi bom kubur dahsyat ledak...   0
13168                          situ beri foto kutil onta   1

[13169 rows x 2 columns]


<hr>
<b>Stop Word</b> is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)

<i>Declaring the Indonesian stop words</i>

In [1]:
indonesian_stopwords = pd.read_csv('stopwords.txt', sep="\n")
indonesian_stopwords = indonesian_stopwords.iloc[:, 0].values.tolist()
indonesian_stopwords.head(10)

NameError: name 'pd' is not defined

Replace alay words

In [18]:
alay_words = pd.read_csv('alay.csv')
# alay_words = alay_words.set_index("alay")
alay_words

row = alay_words[alay_words.alay == "3x"]
# row.empty

In [19]:
# row
print(str(row['replacement'].values[0]))

tiga kali


<hr>

### Load and Clean Dataset

In the original dataset, the tweets are still dirty. There are still html tags, numbers, uppercase, and punctuations. This will not be good for training, so in <b>load_dataset()</b> function, beside loading the dataset using <b>pandas</b>, I also pre-process the tweets by removing html tags, non alphabet (punctuations and numbers), stop words, and lower case all of the tweets.

### Encode Sentiments
In the same function, I also encode the sentiments into integers (0 and 1). Where 0 is for negative sentiments and 1 is for positive sentiments.

In [6]:
def stemmer(text):
    # Init indonesian stemmer
    factory = StemmerFactory()
    s = factory.create_stemmer()
    result = s.stem(text)
    print(result)
    return result

def remove_stopwords(tweet):
    output = []
    words = tweet.split()
    for word in words:
      if word not in indonesian_stopwords:
        output.append(word)

    return ' '.join(output)

def replace_alay(tweet):
    output = []
    words = tweet.split()
    for word in words:
      row = alay_words[alay_words.alay == word]
      if row.empty:
        output.append(word)
      else:
        output.append(str(row['replacement'].values[0]))

    return ' '.join(output)

    

def load_dataset():
    df = pd.read_csv('utf8_dataset.csv')

    # Remove \n \t \r
    df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=[" "," "], regex=True, inplace=True)

    # Tweets/Input
    x_data = df['Tweet']
    
    # Sentiment/Output
    y_data = df['HS']

    # PRE-PROCESS TWEETS
    x_data = x_data.apply(lambda tweet: tweet.lower())

    # Remove HTML tags
    x_data = x_data.replace({'<.*?>': ''}, regex = True)

    # Remove non alphabets
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)

    # Remove words that is lees than 2 chars
    x_data = x_data.apply(lambda tweet: ' '.join([w for w in tweet.split() if len(w) > 2]))

    # Remove RT
    x_data = x_data.str.replace('rt', '')

    # Remove USER
    x_data = x_data.str.replace('user', '')

    # Remove URL
    x_data = x_data.str.replace('url', '')

    # Remove excess spaces
    x_data = x_data.apply(lambda tweet: ' '.join(tweet.split()))

    # Trim
    x_data = x_data.str.strip()
    
    # Remove stop words
    x_data = x_data.apply(lambda tweet: remove_stopwords(tweet))

    # Replace alay words
    x_data = x_data.apply(lambda tweet: replace_alay(tweet))

    # Stem
    x_data = x_data.apply(lambda tweet: stemmer(tweet))
    
    # ENCODE SENTIMENT -> 0 & 1
    y_data = y_data.replace(1, 1)
    y_data = y_data.replace(0, 0)

    return x_data, y_data

x_data, y_data = load_dataset()

In [7]:
print('Tweet')
print(x_data, '\n')
print('HS')
print(y_data)

Tweet
0        - disaat semua cowok berusaha melacak perhatia...
1        RT USER: USER siapa yang telat ngasih tau elu?...
2        41. Kadang aku berfikir, kenapa aku tetap perc...
3        USER USER AKU ITU AKU  KU TAU MATAMU SIPIT TAP...
4        USER USER Kaum cebong kapir udah keliatan dong...
                               ...                        
13164    USER jangan asal ngomong ndasmu. congor lu yg ...
13165                         USER Kasur mana enak kunyuk'
13166    USER Hati hati bisu :( .g  lagi bosan huft \xf...
13167    USER USER USER USER Bom yang real mudah terdet...
13168    USER Mana situ ngasih(": itu cuma foto ya kuti...
Name: Tweet, Length: 13169, dtype: object 

HS
0        1
1        0
2        0
3        0
4        1
        ..
13164    1
13165    0
13166    0
13167    0
13168    1
Name: HS, Length: 13169, dtype: int64


In [132]:
x_data = x_data.str.replace('uniform resource locator', '')
x_data = x_data.apply(lambda tweet: ' '.join([w for w in tweet.split() if not w.startswith('x')]))
x_data = x_data.apply(lambda tweet: ' '.join([w for w in tweet.split() if len(w) > 2]))
x_data

0        di saat cowok usaha lacak perhati gue kamu lan...
1        telat beri tau kamu edan sarap gue gaul cigax ...
2        kadang pikir percaya tuhan jatuh kali kali kad...
3                                     tau mata sipit lihat
4          kaum cebong kafir sudah lihat dongok dungu haha
                               ...                        
13164                     bicara ndasmu congor kate anjing
13165                                    kasur enak kunyuk
13166                             hati hati bisu bosan duh
13167    bom real mudah deteksi bom kubur dahsyat ledak...
13168                            situ beri foto kutil onta
Name: Tweet, Length: 13169, dtype: object

In [163]:
x_data.to_csv('clean-tweet.csv')
y_data.to_csv('clean-hs.csv')

<hr>

### Split Dataset
In this work, I decided to split the data into 80% of Training and 20% of Testing set using <b>train_test_split</b> method from Scikit-Learn. By using this method, it automatically shuffles the dataset. We need to shuffle the data because in the original dataset, the tweets and sentiments are in order, where they list positive tweets first and then negative tweets. By shuffling the data, it will be distributed equally in the model, so it will be more accurate for predictions.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
1616                  Tidak melanggar undang-undang apapun
11057    1. Siang hari ini, Presiden USER memimpin Rapa...
10297    USER Pusing w Uda 2 x Lihat in anak orang eek ...
9085     Bukan caci maki lagi, 7 juta bahkan 10 juta si...
12051    USER USER USERUSER ABADI AYO DUKUNG TIMNAS IND...
                               ...                        
4091     USER USER USER USER Dah mas USER gitutu orang ...
11459    kedatangan pasangan Hasanah ini untuk meminta ...
6885     USER ; Jd kt ente sampa dikolong tol ada, seja...
13085    ;Kebanyakan fitnah nih mpok silvy #DebatFinalP...
557      Cie tau ma jokowi ga dapet panggung mulai dah ...
Name: Tweet, Length: 10535, dtype: object 

780                                         USER kaya TAI'
6172     USER Padahal buaya ngga ada bulu \xf0\x9f\x98\...
2372     USER USER Hati2 antek Yahudi, Liberal, LGBT, S...
4492          Pertumbuhan Ekonomi Indonesia Kuartal-I 2018
12069                        USER Hahaha payah tai kotok!'
  

<hr>
<i>Function for getting the maximum tweet length, by calculating the mean of all the tweets length (using <b>numpy.mean</b>)</i>

In [9]:
def get_max_length():
    tweet_length = []
    for tweet in x_train:
        tweet_length.append(len(tweet))

    return int(np.ceil(np.mean(tweet_length)))

max_length = get_max_length()
print(max_length)

114


<hr>

### Tokenize and Pad/Truncate Tweets
A Neural Network only accepts numeric data, so we need to encode the tweets. I use <b>tensorflow.keras.preprocessing.text.Tokenizer</b> to encode the tweets into integers, where each unique word is automatically indexed (using <b>fit_on_texts</b> method) based on <b>x_train</b>. <br>
<b>x_train</b> and <b>x_test</b> is converted into integers using <b>texts_to_sequences</b> method.

Each tweets has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all tweets length) using <b>tensorflow.keras.preprocessing.sequence.pad_sequences</b>.


<b>post</b>, pad or truncate the words in the back of a sentence<br>
<b>pre</b>, pad or truncate the words in front of a sentence

In [10]:
# ENCODE TWEETS
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum tweets length: ', max_length)
print('Total words: ', total_words)

Encoded X Train
 [[ 269 3212 1734 ...    0    0    0]
 [  86 6261  120 ...    0    0    0]
 [   1 6262  756 ...    0    0    0]
 ...
 [   1 1699 1780 ...    0    0    0]
 [4012  539  134 ...    0    0    0]
 [4871   92  757 ...    0    0    0]] 

Encoded X Test
 [[    1   167     0 ...     0     0     0]
 [    1   739   502 ...     0     0     0]
 [    1     1  5405 ...     0     0     0]
 ...
 [    1     1  5121 ...     0     0     0]
 [   13     1     8 ...     0     0     0]
 [    1  2880 18243 ...     0     0     0]] 

Maximum tweets length:  114
Total words:  32659


<hr>

### Build Architecture/Model
<b>Embedding Layer</b>: in simple terms, it creates word vectors of each word in the <i>word_index</i> and group words that are related or have similar meaning by analyzing other words around them.

<b>LSTM Layer</b>: to make a decision to keep or throw away data by considering the current input, previous output, and previous memory. There are some important components in LSTM.
<ul>
    <li><b>Forget Gate</b>, decides information is to be kept or thrown away</li>
    <li><b>Input Gate</b>, updates cell state by passing previous output and current input into sigmoid activation function</li>
    <li><b>Cell State</b>, calculate new cell state, it is multiplied by forget vector (drop value if multiplied by a near 0), add it with the output from input gate to update the cell state value.</li>
    <li><b>Ouput Gate</b>, decides the next hidden state and used for predictions</li>
</ul>

<b>Dense Layer</b>: compute the input with the weight matrix and bias (optional), and using an activation function. I use <b>Sigmoid</b> activation function for this work because the output is only 0 or 1.

The optimizer is <b>Adam</b> and the loss function is <b>Binary Crossentropy</b> because again the output is only 0 and 1, which is a binary number.

In [11]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64

embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# model.add(Embedding(2000, embed_dim,input_length = x_train.shape[1]))
# model.add(SpatialDropout1D(0.4))
# model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
# model.add(Dense(2,activation='softmax'))
# model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 114, 32)           1045088   
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 1,069,985
Trainable params: 1,069,985
Non-trainable params: 0
_________________________________________________________________
None


<hr>

### Training
For training, it is simple. We only need to fit our <b>x_train</b> (input) and <b>y_train</b> (output/label) data. For this training, I use a mini-batch learning method with a <b>batch_size</b> of <i>128</i> and <i>5</i> <b>epochs</b>.

Also, I added a callback called **checkpoint** to save the model locally for every epoch if its accuracy improved from the previous epoch.

In [12]:
checkpoint = ModelCheckpoint(
    'models/LSTM_2.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

In [13]:
model.fit(x_train, y_train, batch_size = 128, epochs = 10, callbacks=[checkpoint])

Epoch 1/10
Epoch 00001: accuracy improved from -inf to 0.57428, saving model to models\LSTM_2.h5
Epoch 2/10
Epoch 00002: accuracy improved from 0.57428 to 0.57636, saving model to models\LSTM_2.h5
Epoch 3/10
Epoch 00003: accuracy did not improve from 0.57636
Epoch 4/10
Epoch 00004: accuracy did not improve from 0.57636
Epoch 5/10
Epoch 00005: accuracy did not improve from 0.57636
Epoch 6/10
Epoch 00006: accuracy did not improve from 0.57636
Epoch 7/10
Epoch 00007: accuracy did not improve from 0.57636
Epoch 8/10
Epoch 00008: accuracy did not improve from 0.57636
Epoch 9/10
Epoch 00009: accuracy did not improve from 0.57636
Epoch 10/10
Epoch 00010: accuracy did not improve from 0.57636


<keras.callbacks.History at 0x2c0b0490220>

<hr>

### Testing
To evaluate the model, we need to predict the sentiment using our <b>x_test</b> data and comparing the predictions with <b>y_test</b> (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data. Resulted an accuracy of <b>86.63%</b>

In [14]:
# y_pred = model.predict_classes(x_test, batch_size = 128)

predict_x = model.predict(x_test, batch_size = 128) 
y_pred = np.argmax(predict_x,axis=1)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

Correct Prediction: 1536
Wrong Prediction: 1098
Accuracy: 58.31435079726651


In [31]:
data = data.dropna()
X = token.texts_to_sequences(data['Tweet'].values)
X = pad_sequences(X)
Y = pd.get_dummies(data['HS']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20, random_state = 42)

print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

predict_x = model.predict(X_test) 
classes_x = np.argmax(predict_x,axis=1)

df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred':classes_x})
print(df_test.head())
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))
print("confusion matrix",confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred))

(10492, 44) (10492, 2)
(2624, 44) (2624, 2)


ValueError: in user code:

    File "C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py", line 1621, in predict_function  *
        return step_function(self, iterator)
    File "C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py", line 1611, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py", line 1604, in run_step  **
        outputs = model.predict_step(data)
    File "C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py", line 1572, in predict_step
        return self(x, training=False)
    File "C:\ProgramData\Anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\input_spec.py", line 263, in assert_input_compatibility
        raise ValueError(f'Input {input_index} of layer "{layer_name}" is '

    ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 114), found shape=(32, 44)


---

### Load Saved Model

Load saved model and use it to predict a tweet statement's sentiment (positive or negative).

In [30]:
loaded_model = load_model('models/LSTM.h5')

Receives a tweet as an input to be predicted

In [31]:
tweet = str(input('Tweet: '))

Tweet: asd


The input must be pre processed before it is passed to the model to be predicted

In [32]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
tweet = regex.sub('', tweet)
print('Cleaned: ', tweet)

words = tweet.split(' ')
filtered = [w for w in words if w not in indonesian_stopwords]
filtered = ' '.join(filtered)
filtered = [tweet.lower()]

print('Filtered: ', filtered)

Cleaned:  asd
Filtered:  ['asd']


Once again, we need to tokenize and encode the words. I use the tokenizer which was previously declared because we want to encode the words based on words that are known by the model.

In [33]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=get_max_length(), padding='post', truncating='post')
print(tokenize_words)

[[0 0 0 0 0 0 0 0 0 0 0 0]]


This is the result of the prediction which shows the **confidence score** of the tweet statement.

In [34]:
result = loaded_model.predict(tokenize_words)
print(result)

[[0.23105481]]


If the confidence score is close to 0, then the statement is **negative**. On the other hand, if the confidence score is close to 1, then the statement is **positive**. I use a threshold of **0.7** to determine which confidence score is positive and negative, so if it is equal or greater than 0.7, it is **positive** and if it is less than 0.7, it is **negative**

In [35]:
if result >= 0.7:
    print('positive')
else:
    print('negative')

negative


In [36]:
tokenizer_json = token.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))