# 1 - Introduction

- Dataset : `cyberbullying_tweets` 

*Dataset* ini berisi informasi mengenai tweet dari setiap akun dan tipe dari tweet tersebut. Tipe tweet dibagi menjadi 5 yaitu *religion*, *age*, *gender*, *ethnicity* dan *not cyberbullying*

- Objective : 
    
  Inferencing model LSTM Improvement untung mengklasifikasikan tweet baru

# 2. Import Library

Selanjutnya saya akan *import library* yang dibutuhkan

In [1]:
# Library Load Model
import pandas as pd
import numpy as np
from tensorflow.keras.models import load_model

# Library Pre-Processing
from nltk.stem import WordNetLemmatizer
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# 3. Import File Model

Import file dengan menggunakan bantuan library `load_model` tensorflow

In [2]:
# Load Model
  
model_lstm = load_model('best_model')

# 4. Membuat Fungsi Pre-Processing Text

Selanjutnya adalah membuat fungsi *pre-processing* text

In [3]:
# Additional Stopwords
additional_stopwords = ['rt', 'mkr', 'didn', 'bc', 'n', 'm', 
                  'im', 'll', 'y', 've', 'u', 'ur', 'don', 
                  'p', 't', 's', 'aren', 'kp', 'o', 'kat',
                  'de', 're', 'amp', 'will', 'wa', 'e', 'like', 'andre', 'na', 're', 'lil', 'd', 'na', 'pete', 'annie', 'nikki', 'lmao', 'miley', 'wan', 'gon']

In [4]:
# Setting stopwords english
stpwds_eng = list(set(stopwords.words('english')))
for i in additional_stopwords:
    stpwds_eng.append(i)

In [5]:
# Membuat Fungsi Pre-Processing Text

cleaning_pattern = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
lemmatizer = WordNetLemmatizer()

def text_proses(teks):

    # Mengubah Teks ke Lowercase
    teks = teks.lower()

    # Menghilangkan Link
    teks = re.sub(cleaning_pattern, ' ', teks)

    # Menghilangkan Mention
    teks = re.sub("@[A-Za-z0-9_]+", " ", teks)
  
    # Menghilangkan Hashtag
    teks = re.sub("#[A-Za-z0-9_]+", " ", teks)

    # Menghilangkan \n
    teks = re.sub(r"\\n", " ",teks)

    # Menghilangkan kata dibawah 3 char
    teks = re.sub(r'\b\w{1,3}\b', " ",teks)
  
    # Menghilangkan Whitespace
    teks = teks.strip()

    # Menghilangkan yang Bukan Huruf seperti Emoji, Gamma dll
    teks = re.sub("[^A-Za-z\s']", " ", teks)

    # Menghilangkan double space
    teks = re.sub("\s\s+" , " ", teks)
        
    # Melakukan Tokenisasi
    tokens = word_tokenize(teks)

    # Menghilangkan Stopwords
    teks = ' '.join([word for word in tokens if word not in stpwds_eng])

    # Melakukan Lemmatizer
    teks = lemmatizer.lemmatize(teks)
   

    return teks

# 5. Membuat Data Inferencing

Saya membuat 1 *data inferencing* dan *convert* ke *dataframe* dengan bantuan pandas

In [6]:
# Create New Data 

data_inf = {
    'tweet_text' : 'I want to eat samyang with spicy sauce level 5'                                
}

data_inf = pd.DataFrame([data_inf])
data_inf

Unnamed: 0,tweet_text
0,I want to eat samyang with spicy sauce level 5


# 5. Pre-Processing Data Inference

Selanjutnya adalah Pre-Processing Data Inference

In [7]:
# Preprocessing Data Inference
data_inf['tweet_processed'] = data_inf['tweet_text'].apply(lambda x: text_proses(x))

In [8]:
# Mengecek keberhasilan Pre-Processing
data_inf

Unnamed: 0,tweet_text,tweet_processed
0,I want to eat samyang with spicy sauce level 5,want samyang spicy sauce level


# 6. Prediksi Jenis Tweet

Yang terakhir adalah prediksi jenis tweet

In [9]:
# Prediksi jenis tweet

y_inf_pred = np.argmax(model_lstm.predict(data_inf['tweet_processed']), axis=-1)

# Membuat fungsi untuk return result prediksi
if y_inf_pred[0] == 0:
    result = 'age'
elif y_inf_pred[0] == 1:
    result = 'ethnicity'
elif y_inf_pred[0] == 2:
    result = 'gender'
elif y_inf_pred[0] == 3:
    result = 'not_cyberbullying'
elif y_inf_pred[0] == 4:
    result = 'other_cyberbullying'
else:
    result = 'religion'

# Print Result
print(result)

not_cyberbullying


Berdasarkan informasi diatas maka jenis tweet termasuk `not_cyberbullying`