# Text Classification
Oleh:
* Muhammad Aqmal Pangestu - 1301180518
* Rizkyta Shainy Angeline - 1301184305

Pada notebook ini akan dilakukan klasifikasi emosi berdasarkan kalimat pada teks. Klasifikasi dilakukan menggunakan library Tensorflow dengan model LSTM.

Dataset yang digunakan: https://www.kaggle.com/ishantjuyal/emotions-in-text


### Import Library

In [25]:
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

### Load & Explore Dataset

In [6]:
df = pd.read_csv('/content/Emotion_final.csv')
df.head()

Unnamed: 0,Text,Emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


* Jumlah emosi yang terdapat pada data adalah 6.

In [8]:
df.Emotion.value_counts()

happy       7029
sadness     6265
anger       2993
fear        2652
love        1641
surprise     879
Name: Emotion, dtype: int64

* Dataset akan dipisah menjadi 2, train dan test. Train digunakan untuk data latih model LSTM dan test digunakan untuk mengukur akurasi klasifikasi oleh model.
* Sebaran dari data train dan test adalah train=60%, test=40%

In [14]:
df_train, df_test = train_test_split(df, test_size=0.4)

* Dilakukan dummy encoding untuk memecah label `Emotion` sehingga memiliki nilai biner.

In [16]:
category = pd.get_dummies(df_train.Emotion)
df_train_new = pd.concat([df_train, category], axis=1)
df_train_new = df_train_new.drop(columns=['Emotion'])

category = pd.get_dummies(df_test.Emotion)
df_test_new = pd.concat([df_test, category], axis=1)
df_test_new = df_test_new.drop(columns=['Emotion'])
df_test_new

Unnamed: 0,Text,anger,fear,happy,love,sadness,surprise
11533,i could feel that the person was pissed at me ...,1,0,0,0,0,0
616,i feel like it would be too clever and get int...,0,0,1,0,0,0
19089,i could think was i wonder how many days i hav...,0,0,0,0,1,0
14186,i know how that feels hermione said in a surpr...,0,0,0,1,0,0
4089,i feeling so miserable when actually my mum sh...,0,0,0,0,1,0
...,...,...,...,...,...,...,...
17383,this monday i took a math bs test and flunked ...,0,0,0,0,1,0
603,i cant believe the moment where i feel the mos...,0,0,1,0,0,0
14715,i reckon this is fair enough yes the queen is ...,0,0,0,1,0,0
5878,i do give up at times when i feel there s no p...,1,0,0,0,0,0


### Tokenization

Data teks akan dialkukan tokenisasi sehingga data tersebut dapat dimasukkan sebagai input model LSTM.

In [21]:
# Data teks
text_train = df_train_new['Text'].values
text_test = df_test_new['Text'].values
# Label data
label_train = df_train_new[['anger', 'fear', 'happy', 'love', 'sadness', 'surprise']].values
label_test = df_test_new[['anger', 'fear', 'happy', 'love', 'sadness', 'surprise']].values

In [23]:
# Tokenisasi 
tokenizer = Tokenizer(num_words=2000, oov_token='x')
tokenizer.fit_on_texts(text_train) 
tokenizer.fit_on_texts(text_test)
# Texts to sequence
sekuens_train = tokenizer.texts_to_sequences(text_train)
sekuens_test = tokenizer.texts_to_sequences(text_test)
# Padding untuk input model LSTM
padded_train = pad_sequences(sekuens_train) 
padded_test = pad_sequences(sekuens_test)

### LSTM Model
Membuat model LSTM dengan 1 input layer, 1 LSTM layer, dan 1 dense layer. 

* Fungsi aktivasi = `softmax `
* Fungsi loss = `categorial crossentropy`
* Optimizer = `adam`
* Metric = `accuracy`

Model akan dilatih hingga epoch = 15

In [27]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=2000, output_dim=16),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(6, activation='softmax')
])
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

* Fungsi callback digunakan jika akurasi melebihi 99%. Jika akurasi melebihi batas tersebut, proses training akan berhenti.

In [31]:
class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
    if(logs.get('val_accuracy')>0.99):
      print("\nAccuracy = ",logs.get('val_accuracy') )
      self.model.stop_training = True
callbacks = myCallback()

In [32]:
train = model.fit(padded_train, label_train, epochs=15, batch_size=16,
                    validation_data=(padded_test, label_test), verbose=2, callbacks=[callbacks])

Epoch 1/15
805/805 - 33s - loss: 1.3839 - accuracy: 0.4580 - val_loss: 0.9857 - val_accuracy: 0.6253 - 33s/epoch - 41ms/step
Epoch 2/15
805/805 - 29s - loss: 0.5856 - accuracy: 0.8014 - val_loss: 0.4480 - val_accuracy: 0.8480 - 29s/epoch - 36ms/step
Epoch 3/15
805/805 - 29s - loss: 0.2973 - accuracy: 0.9003 - val_loss: 0.3300 - val_accuracy: 0.8832 - 29s/epoch - 36ms/step
Epoch 4/15
805/805 - 28s - loss: 0.2240 - accuracy: 0.9209 - val_loss: 0.3204 - val_accuracy: 0.8889 - 28s/epoch - 34ms/step
Epoch 5/15
805/805 - 28s - loss: 0.1745 - accuracy: 0.9372 - val_loss: 0.3229 - val_accuracy: 0.8873 - 28s/epoch - 35ms/step
Epoch 6/15
805/805 - 28s - loss: 0.1598 - accuracy: 0.9441 - val_loss: 0.3147 - val_accuracy: 0.8871 - 28s/epoch - 35ms/step
Epoch 7/15
805/805 - 28s - loss: 0.1376 - accuracy: 0.9494 - val_loss: 0.3433 - val_accuracy: 0.8835 - 28s/epoch - 35ms/step
Epoch 8/15
805/805 - 28s - loss: 0.1263 - accuracy: 0.9556 - val_loss: 0.3533 - val_accuracy: 0.8844 - 28s/epoch - 35ms/step


#### Diperoleh akurasi = 97%

### Test with New Text

In [49]:
# Input custom text
sample_text = ('The movie was cool. The animation and the graphics were out of this world. I would recommend this movie.')

In [74]:
# Tokenisasi
tokenizer = Tokenizer(num_words=2000, oov_token='x')
tokenizer.fit_on_texts(sample_text) 
# Texts to sequence
sekuens_sample = tokenizer.texts_to_sequences(sample_text)
# Padding untuk input model LSTM
padded_sample = pad_sequences(sekuens_sample) 

predictions = model.predict(np.array([padded_sample]))
result_anger = '{:.2%}'.format(predictions[0][0])
result_fear = '{:.2%}'.format(predictions[0][1])
result_happy = '{:.2%}'.format(predictions[0][2])
result_love = '{:.2%}'.format(predictions[0][3])
result_sadness = '{:.2%}'.format(predictions[0][4])
result_surprise = '{:.2%}'.format(predictions[0][5])
result = ('| Anger = ' + result_anger + ' | Fear = ' + result_fear + ' | Happy = ' + result_happy
+ ' | Love = ' + result_love + ' | Sadness = ' + result_sadness + ' | Surprise = ' + result_surprise+' |')

In [75]:
result

'| Anger = 5.62% | Fear = 1.06% | Happy = 84.31% | Love = 4.76% | Sadness = 2.85% | Surprise = 1.40% |'

### Conclusion
Pada teks sample, emosi yang diprediksi sesuai dengan yang seharusnya. Akurasi yang didapatkan oleh model LSTM adalah 97% dengan epoch = 15.