# **Projek Akhir Praktikum Data Science**

> Analisis Sentimen Mengenai Vaksin COVID-19 Di Indonesia Menggunakan Metode Naive Bayes Classifier dan NLP Pada Sosial Media Twitter

**Oleh Kelompok 2 :**
1. Hazlan Muhammad Qodri (123190080) @hzlnqodrey
2. Elisia Dwi Rahayu (123190062) @elisiadwirahayu
3. Shania Septika Inayasari (123190055) @shaniainayasari

**Penjelasan Projek :**

Adapun pada penelitian menekankan kepada sentimen masyarakat terhadap mengenai vaksin COVID-19. Proses analisisnya akan dilakukan berdasarkan tweet yang menyertakan tagar vaksin dan pencarian di twitter dengan keyword vaksin covid 19.

## **1. Scraping Data from Twitter**

In [None]:
%pip install snscrape

In [None]:
import snscrape.modules.twitter as sntwitter
import pandas as pd

#### Query

In [None]:
# Get All Covid Sentiment Data from January 1st, 2020 until November 1st, 2022
query = "covid since:2020-01-01 until:2022-11-01 lang:id"
limit = 2000 # limit 50k rows

In [None]:
tweets = []

for tweet in sntwitter.TwitterSearchScraper(query).get_items():
    if len(tweets) == limit:
        break
    else:
        tweets.append([
            tweet.date,
            tweet.username,
            tweet.content
        ])

filename = 'tweets_covid_dataset_2k_raw_noindex.csv'
tweets_df = pd.DataFrame(tweets, columns=['Tanggal', 'Username', 'Text'])
tweets_df.to_csv(filename, index=False)
print('Scraping has completed!')

## **2. Wrangling Data** (Preprocessing)

In [None]:
%pip install tweet-preprocessor
%pip install textblob
%pip install wordcloud
%pip install nltk

In [10]:
import preprocessor as preproc
from textblob import TextBlob
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
import csv
import string
import pandas as pd
import numpy as np
import re

[nltk_data] Downloading package punkt to C:\Users\HAZLAN M
[nltk_data]     QODRI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
# get data from dataset
data = pd.read_csv('https://raw.githubusercontent.com/hzlnqodrey/projek-akhir-prak-ds-sentimen-analisis-twitter-covid19-nlp-bayes/main/data_csv/tweets_covid_dataset_2k_raw_noindex.csv')

In [None]:
data.info()

In [None]:
data.sample(n=5)

##### 1. Case Folding

In [12]:
data['Text'] = data['Text'].str.lower()

In [None]:
print('Case Folding Result : \n')
data.sample(n=5)

In [None]:
data.isnull().sum()

##### 2. Cleaning

In [13]:
# cleaning overall
def preprocessing_data(x):
    return preproc.clean(x)

data['Text'] = data['Text'].apply(preprocessing_data)


In [None]:
print('Cleaning Result : \n')
data.sample(n=5)

In [14]:
# cleaning remove_comments_special
def remove_comments_special(text):
    # remove tab, new line, and back slice
    text = text.replace('\\t', " ").replace('\\n', " ").replace(
        '\\u', " ").replace('\\', " ").replace('.', " ")
    # remove non ASCII (emoticon, chinese word, .etc)
    text = text.encode('ascii', 'replace').decode('ascii')
    # remove mention, link, hashtag
    text = ' '.join(
        re.sub("([@#][A-Za-z0-9]+)|(\w+:\/\/\S+)", " ", text).split())
    # remove ascii decoded
    text = ' '.join(
        re.sub("amp; ", " ", text).split())
    text = ' '.join(
        re.sub("lt; ", " ", text).split())
    text = ' '.join(
        re.sub("gt; ", " ", text).split())
    # remove single char
    text = ' '.join(
        re.sub(r"\b[a-zA-Z]\b", " ", text).split())
    return text

data['Text'] = data['Text'].apply(remove_comments_special)

# remove symbol
def remove_symbol(text):
    text = ''.join(
        re.sub(r"[\!\@\#\$\%\^\&\*\(\)\?\,\"\|\:]+", "", text)
    )
    return text

data['Text'] = data['Text'].apply(remove_symbol)

# remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))


print('Cleaning Result : \n')
data.sample(n=5)

Cleaning Result : 



Unnamed: 0,Tanggal,Username,Text
815,2022-10-31 11:44:36+00:00,beritajatimcom,posko covid di sampang masih aktif warga diimb...
36,2022-10-31 23:24:47+00:00,NisakMs,lamalama jadi dr umum ama sp di jkt harus ment...
107,2022-10-31 21:36:46+00:00,SINDOnews,kasus covid-19 kembali meningkat malaysia anju...
110,2022-10-31 21:17:31+00:00,nisailmiaa,kdg mikir andai covid iki gak ono what college...
1249,2022-10-31 08:54:34+00:00,BergemaRock,kaga percaya covad covid dan vaksin


##### 3. Tokenizing

In [15]:
def tokenize_data(text):
    return word_tokenize(text)

data['Text_Clean'] = data['Text'].apply(tokenize_data)

In [None]:
print('Tokenizing Result : \n')
data.sample(n=5)

##### 4. Filtering

In [16]:
# Filtering | Singkatan Indo
normalizad_word = pd.read_csv(
    "https://raw.githubusercontent.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset/master/kamus_singkatan.csv", sep=";", header=None)
normalizad_word_dict = {}

for index, row in normalizad_word.iterrows():
    if row[0] not in normalizad_word_dict:
        normalizad_word_dict[row[0]] = row[1]


def normalized_term(document):
    return [normalizad_word_dict[term] if term in normalizad_word_dict else term for term in document]


data['Text_Clean'] = data['Text_Clean'].apply(normalized_term)
data.head(5)

Unnamed: 0,Tanggal,Username,Text,Text_Clean
0,2022-10-31 23:59:58+00:00,lordkuyang,dulu di barat upn ada warung kayuh bambai bumb...,"[dulu, di, barat, upn, ada, warung, kayuh, bam..."
1,2022-10-31 23:58:49+00:00,s_h_y_l_l_a,segala sakit yg disalahin vaksin covid lieur,"[segala, sakit, yang , disalahin, vaksin, covi..."
2,2022-10-31 23:58:14+00:00,kunsh72,bukankah kamu dulu bagian rezim namun setelah ...,"[bukankah, kamu, dulu, bagian, rezim, namun, s..."
3,2022-10-31 23:57:58+00:00,erni076,aku kan ganti lama dulu sebelum covid menyeran...,"[saya, kan, ganti, lama, dulu, sebelum, covid,..."
4,2022-10-31 23:57:38+00:00,KENTUSIAS,trus masku kena covid lagi kek yah udah endemi...,"[terus , masku, kena, covid, lagi, kek, yah, s..."


In [17]:
# Filtering | Stop Word
list_stopwords = (['yang', 'untuk', 'pada', 'ke', 'para', 'namun', 'menurut', 'antara', 'seperti', 'jika', 'jika', 'sehingga', 'mungkin', 'kembali', 'dan', 'ini', 'karena', 'oleh', 'saat', 'sekitar', 'bagi', 'serta', 'di', 'dari', 'sebagai', 'hal', 'ketika', 'adalah', 'itu', 'dalam', 'bahwa', 'atau', 'dengan', 'akan', 'juga', 'kalau', 'ada', 'terhadap', 'secara', 'agar', 'lain', 'jadi', 'yang ', 'sudah', 'sudah begitu', 'mengapa', 'kenapa', 'yaitu', 'yakni', 'daripada', 'itulah', 'lagi', 'maka', 'tentang', 'demi', 'dimana', 'kemana', 'pula', 'sambil', 'sebelum', 'sesudah', 'supaya', 'guna', 'kah', 'pun', 'sampai', 'sedangkan', 'selagi',
                  'sementara', 'tetapi', 'apakah', 'sebab', 'selain', 'seolah', 'seraya', 'seterusnya', 'dsb', 'dst', 'dll', 'dahulu', 'dulunya', 'anu', 'demikian', 'tapi', 'juga', 'mari', 'nanti', 'melainkan', 'oh', 'ok', 'sebetulnya', 'setiap', 'sesuatu', 'pasti', 'saja', 'toh', 'ya', 'walau', 'apalagi', 'bagaimanapun', 'yg', 'dg', 'rt', 'dgn', 'ny', 'd', 'klo', 'kalo', 'amp', 'biar', 'bikin', 'bilang', 'krn', 'nya', 'nih', 'sih', 'ah', 'ssh', 'om', 'ah', 'si', 'tau', 'tuh', 'utk', 'ya', 'cek', 'jd', 'aja', 't', 'nyg', 'hehe', 'pen', 'nan', 'loh', 'rt', '&amp', 'yah', 'ni', 'ret', 'za', 'nak', 'haa', 'zaa', 'maa', 'lg', 'eh', 'hmm', 'kali'])

list_stopwords = set(list_stopwords)


def stopwords_removal(words):
    return [word for word in words if word not in list_stopwords]


data['Text_Clean'] = data['Text_Clean'].apply(stopwords_removal)

In [18]:
data.head(5)

Unnamed: 0,Tanggal,Username,Text,Text_Clean
0,2022-10-31 23:59:58+00:00,lordkuyang,dulu di barat upn ada warung kayuh bambai bumb...,"[dulu, barat, upn, warung, kayuh, bambai, bumb..."
1,2022-10-31 23:58:49+00:00,s_h_y_l_l_a,segala sakit yg disalahin vaksin covid lieur,"[segala, sakit, disalahin, vaksin, covid, lieur]"
2,2022-10-31 23:58:14+00:00,kunsh72,bukankah kamu dulu bagian rezim namun setelah ...,"[bukankah, kamu, dulu, bagian, rezim, setelah,..."
3,2022-10-31 23:57:58+00:00,erni076,aku kan ganti lama dulu sebelum covid menyeran...,"[saya, kan, ganti, lama, dulu, covid, menyeran..."
4,2022-10-31 23:57:38+00:00,KENTUSIAS,trus masku kena covid lagi kek yah udah endemi...,"[terus , masku, kena, covid, kek, sudah , ende..."


In [19]:
def sambungin_kata(text):
    text = ' '.join(text)
    return text

data['Text_Clean_Sambung'] = data['Text_Clean'].apply(sambungin_kata)

In [20]:
data

Unnamed: 0,Tanggal,Username,Text,Text_Clean,Text_Clean_Sambung
0,2022-10-31 23:59:58+00:00,lordkuyang,dulu di barat upn ada warung kayuh bambai bumb...,"[dulu, barat, upn, warung, kayuh, bambai, bumb...",dulu barat upn warung kayuh bambai bumbu haban...
1,2022-10-31 23:58:49+00:00,s_h_y_l_l_a,segala sakit yg disalahin vaksin covid lieur,"[segala, sakit, disalahin, vaksin, covid, lieur]",segala sakit disalahin vaksin covid lieur
2,2022-10-31 23:58:14+00:00,kunsh72,bukankah kamu dulu bagian rezim namun setelah ...,"[bukankah, kamu, dulu, bagian, rezim, setelah,...",bukankah kamu dulu bagian rezim setelah terdep...
3,2022-10-31 23:57:58+00:00,erni076,aku kan ganti lama dulu sebelum covid menyeran...,"[saya, kan, ganti, lama, dulu, covid, menyeran...",saya kan ganti lama dulu covid menyerang sudah...
4,2022-10-31 23:57:38+00:00,KENTUSIAS,trus masku kena covid lagi kek yah udah endemi...,"[terus , masku, kena, covid, kek, sudah , ende...",terus masku kena covid kek sudah endemi masi ae
...,...,...,...,...,...
1995,2022-10-31 04:14:37+00:00,Rakhalandikas,iya aku lebih merujuk kepada kerumunan faktor ...,"[iya, saya, lebih, merujuk, kepada, kerumunan,...",iya saya lebih merujuk kepada kerumunan faktor...
1996,2022-10-31 04:14:23+00:00,DumaiBarat,ayo vaksinagar tubuh terlindungi dari covid-19...,"[ayo, vaksinagar, tubuh, terlindungi, covid-19...",ayo vaksinagar tubuh terlindungi covid-19 ayo ...
1997,2022-10-31 04:14:23+00:00,TyasAru14236593,bahkan kalo mau liat di lingkungan sekitar jug...,"[bahkan, kalau , mau, liat, lingkungan, banyak...",bahkan kalau mau liat lingkungan banyak tetan...
1998,2022-10-31 04:13:44+00:00,rezkyheidi,min ada info lokasi tersedia untuk booster vak...,"[min, info, lokasi, tersedia, booster, vaksin,...",min info lokasi tersedia booster vaksin covid


In [21]:
data.drop(['Text_Clean'], axis=1, inplace=True)

In [22]:
data

Unnamed: 0,Tanggal,Username,Text,Text_Clean_Sambung
0,2022-10-31 23:59:58+00:00,lordkuyang,dulu di barat upn ada warung kayuh bambai bumb...,dulu barat upn warung kayuh bambai bumbu haban...
1,2022-10-31 23:58:49+00:00,s_h_y_l_l_a,segala sakit yg disalahin vaksin covid lieur,segala sakit disalahin vaksin covid lieur
2,2022-10-31 23:58:14+00:00,kunsh72,bukankah kamu dulu bagian rezim namun setelah ...,bukankah kamu dulu bagian rezim setelah terdep...
3,2022-10-31 23:57:58+00:00,erni076,aku kan ganti lama dulu sebelum covid menyeran...,saya kan ganti lama dulu covid menyerang sudah...
4,2022-10-31 23:57:38+00:00,KENTUSIAS,trus masku kena covid lagi kek yah udah endemi...,terus masku kena covid kek sudah endemi masi ae
...,...,...,...,...
1995,2022-10-31 04:14:37+00:00,Rakhalandikas,iya aku lebih merujuk kepada kerumunan faktor ...,iya saya lebih merujuk kepada kerumunan faktor...
1996,2022-10-31 04:14:23+00:00,DumaiBarat,ayo vaksinagar tubuh terlindungi dari covid-19...,ayo vaksinagar tubuh terlindungi covid-19 ayo ...
1997,2022-10-31 04:14:23+00:00,TyasAru14236593,bahkan kalo mau liat di lingkungan sekitar jug...,bahkan kalau mau liat lingkungan banyak tetan...
1998,2022-10-31 04:13:44+00:00,rezkyheidi,min ada info lokasi tersedia untuk booster vak...,min info lokasi tersedia booster vaksin covid


## **2.1 Menejermah Tweet Clean hasil Preproc ke Bahasa Inggris** (Translating)

In [None]:
%pip install googletrans==4.0.0rc1

In [None]:
%pip install google-cloud-translate==2.0.1
%pip install --upgrade google-cloud-translate

In [None]:
## NOTE: JIKA MEMAKAI GOOGLE TRANSLATE API WEBSITE
import pandas as pd
from googletrans import Translator

translator = Translator()

translated_word = []
new_row = []

data = data.reset_index()  # make sure indexes pair with number of rows

for index, row in data.iterrows():
    new_row.append(row['Text_Clean_Sambung'])

for per_row in new_row:
    out = translator.translate(per_row, dest='en')
    translated_word.append(out.text)

data.insert(loc=len(data.columns),
            column="text_english", value=translated_word)

print('Translating has completed!')

In [None]:
## NOTE: JIKA MEMAKAI GOOGLE CLOUD TRANSLATE API
import os

from google.cloud import translate_v2

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r"serviceKey.json"

translate_client = translate_v2.Client()

text = "Saya siapa dan kamu dimana"

target = "en"

output = translate_client.translate(text, target_language=target)

print(output)
print(output['translatedText'])

In [23]:
# REAL IMPLEMENTATION
import os

from google.cloud import translate_v2

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r"serviceKey.json"

translate_client = translate_v2.Client()
target = "en"

translated_word = []
new_row = []
data = data.reset_index()  # make sure indexes pair with number of rows

for index, row in data.iterrows():
    new_row.append(row['Text_Clean_Sambung'])

for per_row in new_row:
    output = translate_client.translate(per_row, target_language="en")
    translated_word.append(output['translatedText'])

data.insert(loc=len(data.columns),
            column="text_english", value=translated_word)

print('Translating has completed!')

Translating has completed!


In [24]:
data.head(5)

Unnamed: 0,index,Tanggal,Username,Text,Text_Clean_Sambung,text_english
0,0,2022-10-31 23:59:58+00:00,lordkuyang,dulu di barat upn ada warung kayuh bambai bumb...,dulu barat upn warung kayuh bambai bumbu haban...,"In the past, the western UPN stall, the caulif..."
1,1,2022-10-31 23:58:49+00:00,s_h_y_l_l_a,segala sakit yg disalahin vaksin covid lieur,segala sakit disalahin vaksin covid lieur,all illnesses are blamed for the covid lieur v...
2,2,2022-10-31 23:58:14+00:00,kunsh72,bukankah kamu dulu bagian rezim namun setelah ...,bukankah kamu dulu bagian rezim setelah terdep...,Didn&#39;t you used to be part of the regime a...
3,3,2022-10-31 23:57:58+00:00,erni076,aku kan ganti lama dulu sebelum covid menyeran...,saya kan ganti lama dulu covid menyerang sudah...,"I changed it a long time ago, when Covid attac..."
4,4,2022-10-31 23:57:38+00:00,KENTUSIAS,trus masku kena covid lagi kek yah udah endemi...,terus masku kena covid kek sudah endemi masi ae,then my mask got covid kek it&#39;s already en...


In [None]:
# up to csv
filename = "english_tweets_covid_dataset_2k.csv"
data.to_csv(filename, index=False)

## **2.2 Stemming**

## **3. Modelling Data** (Modeling and Training)

## **4. Klasifikasi Data dengan Naive Bayes** (Classification)