# **Projek Akhir Praktikum Data Science**

> Analisis Sentimen Mengenai Vaksin COVID-19 Di Indonesia Menggunakan Metode Naive Bayes Classifier dan NLP Pada Sosial Media Twitter

**Oleh Kelompok 2 :**
1. Hazlan Muhammad Qodri (123190080) @hzlnqodrey
2. Elisia Dwi Rahayu (123190062) @elisiadwirahayu
3. Shania Septika Inayasari (123190055) @shaniainayasari

**Penjelasan Projek :**

Adapun pada penelitian menekankan kepada sentimen masyarakat terhadap mengenai vaksin COVID-19. Proses analisisnya akan dilakukan berdasarkan tweet yang menyertakan tagar vaksin dan pencarian di twitter dengan keyword vaksin covid 19.

## **1. Scraping Data from Twitter**

In [None]:
%pip install snscrape

In [None]:
import pandas as pd
import numpy as np
import snscrape.modules.twitter as sntwitter
import re


#### Query

In [None]:
# Get All Covid Sentiment Data from January 1st, 2020 until November 1st, 2022
query = "covid since:2020-01-01 until:2022-11-01 lang:id"
limit = 2000 # limit 50k rows

In [None]:
tweets = []

for tweet in sntwitter.TwitterSearchScraper(query).get_items():
    if len(tweets) == limit:
        break
    else:
        tweets.append([
            tweet.date,
            tweet.username,
            tweet.content
        ])

filename = 'tweets_covid_dataset_2k_raw_noindex.csv'
tweets_df = pd.DataFrame(tweets, columns=['Tanggal', 'Username', 'Text'])
tweets_df.to_csv(filename, index=False)
print('Scraping has completed!')

## **2. Wrangling Data** (Preprocessing)

In [None]:
%pip install tweet-preprocessor
%pip install textblob
%pip install wordcloud
%pip install nltk

In [None]:
import preprocessor as preproc
from textblob import TextBlob
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import csv
import string

In [347]:
# get data from dataset
data = pd.read_csv('https://raw.githubusercontent.com/hzlnqodrey/projek-akhir-prak-ds-sentimen-analisis-twitter-covid19-nlp-bayes/main/data_csv/tweets_covid_dataset_2k_raw_noindex.csv')

In [None]:
data.info()

In [None]:
data.sample(n=5)

##### 1. Case Folding

In [348]:
data['Text'] = data['Text'].str.lower()

In [None]:
print('Case Folding Result : \n')
data.sample(n=5)

In [None]:
data.isnull().sum()

##### 2. Cleaning

In [349]:
# cleaning overall
def preprocessing_data(x):
    return preproc.clean(x)

data['Text'] = data['Text'].apply(preprocessing_data)


In [None]:
print('Cleaning Result : \n')
data.sample(n=5)

In [350]:
# cleaning remove_comments_special
def remove_comments_special(text):
    # remove tab, new line, and back slice
    text = text.replace('\\t', " ").replace('\\n', " ").replace(
        '\\u', " ").replace('\\', " ").replace('.', " ")
    # remove non ASCII (emoticon, chinese word, .etc)
    text = text.encode('ascii', 'replace').decode('ascii')
    # remove mention, link, hashtag
    text = ' '.join(
        re.sub("([@#][A-Za-z0-9]+)|(\w+:\/\/\S+)", " ", text).split())
    # remove ascii decoded
    text = ' '.join(
        re.sub("amp; ", " ", text).split())
    text = ' '.join(
        re.sub("lt; ", " ", text).split())
    text = ' '.join(
        re.sub("gt; ", " ", text).split())
    # remove single char
    text = ' '.join(
        re.sub(r"\b[a-zA-Z]\b", " ", text).split())
    return text

data['Text'] = data['Text'].apply(remove_comments_special)

# remove symbol
def remove_symbol(text):
    text = ''.join(
        re.sub(r"[\!\@\#\$\%\^\&\*\(\)\?\,\"\|\:]+", "", text)
    )
    return text

data['Text'] = data['Text'].apply(remove_symbol)

# remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))


print('Cleaning Result : \n')
data.sample(n=5)

Cleaning Result : 



Unnamed: 0,Tanggal,Username,Text
1416,2022-10-31 07:44:53+00:00,beritacovid,indonesia cabut status gawat darurat covid-19 ini penjelasan pemerintah - msn
469,2022-10-31 14:32:35+00:00,kangebadut,oh mungkin berkurangnya krn bnyk yg hilang dri muka bumi krn covid
1485,2022-10-31 07:25:16+00:00,abuhafiz1,jangan la ulangi kesilapan dulu ambil alih lebuhraya ni mahalbeban kewangan negara kita banyak hutang lepas covid-19 janganlah selalu populis panjangkan konsesi dan rendahkan harga tol mungkin alternatif
994,2022-10-31 10:32:37+00:00,dindaasyafiraa,plis jangan covid lagi plis plis sedih banget nanti sia sia persiapan ini
998,2022-10-31 10:31:03+00:00,radiosushifm,sheila on kembali hibur sheilagank lg ni sushimitra setelah kabar hiatu sejak pandemi covid tpi netizen malah fokus ke om duta nya yg tak menua hmmm gimana mnurut kmu sushimitra xixi


In [None]:
data.head(50)

##### 3. Tokenizing

In [351]:
def tokenize_data(text):
    return word_tokenize(text)

data['Text_Clean'] = data['Text'].apply(tokenize_data)

In [352]:
print('Tokenizing Result : \n')
data.sample(n=5)

Tokenizing Result : 



Unnamed: 0,Tanggal,Username,Text,Text_Clean
1685,2022-10-31 06:08:16+00:00,KuretaID,cerita nabilah ayu raih hidayah kenakan hijab di tengah pandemi covid-19,"[cerita, nabilah, ayu, raih, hidayah, kenakan, hijab, di, tengah, pandemi, covid-19]"
1694,2022-10-31 06:04:32+00:00,Sugirama,covid-19 terkendali djbc ungkap nasib insentif fiskal vaksin dan alkes,"[covid-19, terkendali, djbc, ungkap, nasib, insentif, fiskal, vaksin, dan, alkes]"
1429,2022-10-31 07:43:07+00:00,Linda_Christi,awas efek samping covid-19 padawanita,"[awas, efek, samping, covid-19, padawanita]"
190,2022-10-31 17:38:24+00:00,Acidicus,bwahahahahahahahahno,[bwahahahahahahahahno]
471,2022-10-31 14:32:10+00:00,CariTahuPasti,belakangan ini banyak kejadian yg memakan korban hingga ratusan org mulai dr kanjuruhan korea selatan india lainnya salah satu faktornya mungkin orgg2 telah lelah dgn pandemi yg sdh mau thn sehingga byk berkumpultetap hati-hati yaingat jaga prokes covid masih ada,"[belakangan, ini, banyak, kejadian, yg, memakan, korban, hingga, ratusan, org, mulai, dr, kanjuruhan, korea, selatan, india, lainnya, salah, satu, faktornya, mungkin, orgg2, telah, lelah, dgn, pandemi, yg, sdh, mau, thn, sehingga, byk, berkumpultetap, hati-hati, yaingat, jaga, prokes, covid, masih, ada]"


##### 4. Filtering

In [None]:
# Filtering | Singkatan Indo
normalizad_word = pd.read_csv(
    "https://raw.githubusercontent.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset/master/kamus_singkatan.csv", sep=";", header=None)
normalizad_word_dict = {}

for index, row in normalizad_word.iterrows():
    if row[0] not in normalizad_word_dict:
        normalizad_word_dict[row[0]] = row[1]


def normalized_term(document):
    return [normalizad_word_dict[term] if term in normalizad_word_dict else term for term in document]


data['Text_Clean'] = data['Text_Clean'].apply(normalized_term)
data.head(5)

In [None]:
# Filtering | Stop Word
list_stopwords = (['yang', 'untuk', 'pada', 'ke', 'para', 'namun', 'menurut', 'antara', 'seperti', 'jika', 'jika', 'sehingga', 'mungkin', 'kembali', 'dan', 'ini', 'karena', 'oleh', 'saat', 'sekitar', 'bagi', 'serta', 'di', 'dari', 'sebagai', 'hal', 'ketika', 'adalah', 'itu', 'dalam', 'bahwa', 'atau', 'dengan', 'akan', 'juga', 'kalau', 'ada', 'terhadap', 'secara', 'agar', 'lain', 'jadi', 'yang ', 'sudah', 'sudah begitu', 'mengapa', 'kenapa', 'yaitu', 'yakni', 'daripada', 'itulah', 'lagi', 'maka', 'tentang', 'demi', 'dimana', 'kemana', 'pula', 'sambil', 'sebelum', 'sesudah', 'supaya', 'guna', 'kah', 'pun', 'sampai', 'sedangkan', 'selagi',
                  'sementara', 'tetapi', 'apakah', 'sebab', 'selain', 'seolah', 'seraya', 'seterusnya', 'dsb', 'dst', 'dll', 'dahulu', 'dulunya', 'anu', 'demikian', 'tapi', 'juga', 'mari', 'nanti', 'melainkan', 'oh', 'ok', 'sebetulnya', 'setiap', 'sesuatu', 'pasti', 'saja', 'toh', 'ya', 'walau', 'apalagi', 'bagaimanapun', 'yg', 'dg', 'rt', 'dgn', 'ny', 'd', 'klo', 'kalo', 'amp', 'biar', 'bikin', 'bilang', 'krn', 'nya', 'nih', 'sih', 'ah', 'ssh', 'om', 'ah', 'si', 'tau', 'tuh', 'utk', 'ya', 'cek', 'jd', 'aja', 't', 'nyg', 'hehe', 'pen', 'nan', 'loh', 'rt', '&amp', 'yah', 'ni', 'ret', 'za', 'nak', 'haa', 'zaa', 'maa', 'lg', 'eh', 'hmm', 'kali'])

list_stopwords = set(list_stopwords)


def stopwords_removal(words):
    return [word for word in words if word not in list_stopwords]


data['Text_Clean'] = data['Text_Clean'].apply(stopwords_removal)
data.head(5)

In [None]:
data.head(5)

## **2.1 Menejermah Tweet Clean hasil Preproc ke Bahasa Inggris** (Translating)

In [None]:
%pip install googletrans==4.0.0rc1

In [355]:
from googletrans import Translator

translate = Translator()

out = translate.translate('veritas lux', dest='en')

print(out.text)

the truth of light


## **2.2 Stemming**

## **3. Modelling Data** (Modeling and Training)

## **4. Klasifikasi Data dengan Naive Bayes** (Classification)