# **Projek Akhir Praktikum Data Science**

> Analisis Sentimen Mengenai Vaksin COVID-19 Di Indonesia Menggunakan Metode Naive Bayes Classifier dan NLP Pada Sosial Media Twitter

**Oleh Kelompok 2 :**
1. Hazlan Muhammad Qodri (123190080) @hzlnqodrey
2. Elisia Dwi Rahayu (123190062) @elisiadwirahayu
3. Shania Septika Inayasari (123190055) @shaniainayasari

**Penjelasan Projek :**

Adapun pada penelitian menekankan kepada sentimen masyarakat terhadap mengenai vaksin COVID-19. Proses analisisnya akan dilakukan berdasarkan tweet yang menyertakan tagar vaksin dan pencarian di twitter dengan keyword vaksin covid 19.

## **1. Scraping Data from Twitter**

In [None]:
%pip install snscrape

In [None]:
import snscrape.modules.twitter as sntwitter
import pandas as pd

#### Query

In [None]:
# Get All Covid Sentiment Data from January 1st, 2020 until November 1st, 2022
query = "covid since:2020-01-01 until:2022-11-01 lang:id"
limit = 2000 # limit 50k rows

In [None]:
tweets = []

for tweet in sntwitter.TwitterSearchScraper(query).get_items():
    if len(tweets) == limit:
        break
    else:
        tweets.append([
            tweet.date,
            tweet.username,
            tweet.content
        ])

filename = 'tweets_covid_dataset_2k_raw_noindex.csv'
tweets_df = pd.DataFrame(tweets, columns=['Tanggal', 'Username', 'Text'])
tweets_df.to_csv(filename, index=False)
print('Scraping has completed!')

## **2. Wrangling Data** (Preprocessing)

In [None]:
%pip install tweet-preprocessor
%pip install textblob
%pip install wordcloud
%pip install nltk

In [None]:
import preprocessor as preproc
from textblob import TextBlob
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
import csv
import string
import pandas as pd
import numpy as np
import re

In [None]:
# get data from dataset
data = pd.read_csv('https://raw.githubusercontent.com/hzlnqodrey/projek-akhir-prak-ds-sentimen-analisis-twitter-covid19-nlp-bayes/main/data_csv/tweets_covid_dataset_2k_raw_noindex.csv')

In [None]:
data.info()

In [None]:
data.sample(n=5)

##### 1. Case Folding

In [None]:
data['Text'] = data['Text'].str.lower()

In [None]:
print('Case Folding Result : \n')
data.sample(n=5)

In [None]:
data.isnull().sum()

##### 2. Cleaning

In [None]:
# cleaning overall
def preprocessing_data(x):
    return preproc.clean(x)

data['Text'] = data['Text'].apply(preprocessing_data)


In [None]:
print('Cleaning Result : \n')
data.sample(n=5)

In [None]:
# cleaning remove_comments_special
def remove_comments_special(text):
    # remove tab, new line, and back slice
    text = text.replace('\\t', " ").replace('\\n', " ").replace(
        '\\u', " ").replace('\\', " ").replace('.', " ")
    # remove non ASCII (emoticon, chinese word, .etc)
    text = text.encode('ascii', 'replace').decode('ascii')
    # remove mention, link, hashtag
    text = ' '.join(
        re.sub("([@#][A-Za-z0-9]+)|(\w+:\/\/\S+)", " ", text).split())
    # remove ascii decoded
    text = ' '.join(
        re.sub("amp; ", " ", text).split())
    text = ' '.join(
        re.sub("lt; ", " ", text).split())
    text = ' '.join(
        re.sub("gt; ", " ", text).split())
    # remove single char
    text = ' '.join(
        re.sub(r"\b[a-zA-Z]\b", " ", text).split())
    return text

data['Text'] = data['Text'].apply(remove_comments_special)

# remove symbol
def remove_symbol(text):
    text = ''.join(
        re.sub(r"[\!\@\#\$\%\^\&\*\(\)\?\,\"\|\:]+", "", text)
    )
    return text

data['Text'] = data['Text'].apply(remove_symbol)

# remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))


print('Cleaning Result : \n')
data.sample(n=5)

##### 3. Tokenizing

In [None]:
def tokenize_data(text):
    return word_tokenize(text)

data['Text_Clean'] = data['Text'].apply(tokenize_data)

In [None]:
print('Tokenizing Result : \n')
data.sample(n=5)

##### 4. Filtering

In [None]:
# Filtering | Singkatan Indo
normalizad_word = pd.read_csv(
    "https://raw.githubusercontent.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset/master/kamus_singkatan.csv", sep=";", header=None)
normalizad_word_dict = {}

for index, row in normalizad_word.iterrows():
    if row[0] not in normalizad_word_dict:
        normalizad_word_dict[row[0]] = row[1]


def normalized_term(document):
    return [normalizad_word_dict[term] if term in normalizad_word_dict else term for term in document]


data['Text_Clean'] = data['Text_Clean'].apply(normalized_term)
data.head(5)

In [None]:
# Filtering | Stop Word
list_stopwords = (['yang', 'untuk', 'pada', 'ke', 'para', 'namun', 'menurut', 'antara', 'seperti', 'jika', 'jika', 'sehingga', 'mungkin', 'kembali', 'dan', 'ini', 'karena', 'oleh', 'saat', 'sekitar', 'bagi', 'serta', 'di', 'dari', 'sebagai', 'hal', 'ketika', 'adalah', 'itu', 'dalam', 'bahwa', 'atau', 'dengan', 'akan', 'juga', 'kalau', 'ada', 'terhadap', 'secara', 'agar', 'lain', 'jadi', 'yang ', 'sudah', 'sudah begitu', 'mengapa', 'kenapa', 'yaitu', 'yakni', 'daripada', 'itulah', 'lagi', 'maka', 'tentang', 'demi', 'dimana', 'kemana', 'pula', 'sambil', 'sebelum', 'sesudah', 'supaya', 'guna', 'kah', 'pun', 'sampai', 'sedangkan', 'selagi',
                  'sementara', 'tetapi', 'apakah', 'sebab', 'selain', 'seolah', 'seraya', 'seterusnya', 'dsb', 'dst', 'dll', 'dahulu', 'dulunya', 'anu', 'demikian', 'tapi', 'juga', 'mari', 'nanti', 'melainkan', 'oh', 'ok', 'sebetulnya', 'setiap', 'sesuatu', 'pasti', 'saja', 'toh', 'ya', 'walau', 'apalagi', 'bagaimanapun', 'yg', 'dg', 'rt', 'dgn', 'ny', 'd', 'klo', 'kalo', 'amp', 'biar', 'bikin', 'bilang', 'krn', 'nya', 'nih', 'sih', 'ah', 'ssh', 'om', 'ah', 'si', 'tau', 'tuh', 'utk', 'ya', 'cek', 'jd', 'aja', 't', 'nyg', 'hehe', 'pen', 'nan', 'loh', 'rt', '&amp', 'yah', 'ni', 'ret', 'za', 'nak', 'haa', 'zaa', 'maa', 'lg', 'eh', 'hmm', 'kali'])

list_stopwords = set(list_stopwords)


def stopwords_removal(words):
    return [word for word in words if word not in list_stopwords]


data['Text_Clean'] = data['Text_Clean'].apply(stopwords_removal)

In [None]:
data.head(5)

In [None]:
def sambungin_kata(text):
    text = ' '.join(text)
    return text

data['Text_Clean_Sambung'] = data['Text_Clean'].apply(sambungin_kata)

In [None]:
data

In [None]:
data.drop(['Text_Clean'], axis=1, inplace=True)

In [None]:
data

## **2.1 Menejermah Tweet Clean hasil Preproc ke Bahasa Inggris** (Translating)

In [None]:
%pip install googletrans==4.0.0rc1

In [None]:
%pip install google-cloud-translate==2.0.1
%pip install --upgrade google-cloud-translate

In [None]:
## NOTE: JIKA MEMAKAI GOOGLE TRANSLATE API WEBSITE
import pandas as pd
from googletrans import Translator

translator = Translator()

translated_word = []
new_row = []

data = data.reset_index()  # make sure indexes pair with number of rows

for index, row in data.iterrows():
    new_row.append(row['Text_Clean_Sambung'])

for per_row in new_row:
    out = translator.translate(per_row, dest='en')
    translated_word.append(out.text)

data.insert(loc=len(data.columns),
            column="text_english", value=translated_word)

print('Translating has completed!')

In [None]:
## NOTE: JIKA MEMAKAI GOOGLE CLOUD TRANSLATE API
import os

from google.cloud import translate_v2

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r"serviceKey.json"

translate_client = translate_v2.Client()

text = "Saya siapa dan kamu dimana"

target = "en"

output = translate_client.translate(text, target_language=target)

print(output)
print(output['translatedText'])

In [None]:
# REAL IMPLEMENTATION
import os

from google.cloud import translate_v2

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r"serviceKey.json"

translate_client = translate_v2.Client()
target = "en"

translated_word = []
new_row = []
data = data.reset_index()  # make sure indexes pair with number of rows

for index, row in data.iterrows():
    new_row.append(row['Text_Clean_Sambung'])

for per_row in new_row:
    output = translate_client.translate(per_row, target_language="en")
    translated_word.append(output['translatedText'])

data.insert(loc=len(data.columns),
            column="text_english", value=translated_word)

print('Translating has completed!')

In [None]:
data.head(5)

In [None]:
# up to csv
filename = "english_tweets_covid_dataset_2k.csv"
data.to_csv(filename, index=False)

## **2.2 Stemming**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from textblob import TextBlob
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('punkt')
import string
import re

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/hzlnqodrey/projek-akhir-prak-ds-sentimen-analisis-twitter-covid19-nlp-bayes/main/data_csv/english_tweets_covid_dataset_2k.csv')

In [None]:
data.info()

In [None]:
data.drop(['index', 'Text', 'Text_Clean_Sambung'], axis=1, inplace=True)

In [None]:
# hapuskan symbol &#39; > ganti menjadi (')
# remove  &#39; symbol
def remove_symbol(text):
    text = ''.join(
        re.sub(r"[&#39;]+", "'", text)
    )
    return text

data['text_english'] = data['text_english'].apply(remove_symbol)

In [28]:
ps = PorterStemmer()

def stemming_data(text):
    return ps.stem(text)

data['text_english'] = data['text_english'].apply(stemming_data)

In [31]:
data.sample(n=5)

Unnamed: 0,Tanggal,Username,text_english
1012,2022-10-31 10:28:06+00:00,SidarejaPolsek,members of the sidareja police carry out the o...
1855,2022-10-31 05:06:00+00:00,naim_rajin,because the time is critical covid aid morator...
431,2022-10-31 14:48:05+00:00,mewseeshan,"two performers, gilbert nathaniel ibet, left, ..."
374,2022-10-31 15:16:40+00:00,kimchi_wz,very annoyed that the part of asking about cov...
791,2022-10-31 11:54:48+00:00,MRLofficials,already like the covid virus covering the pole...


## **3. Sentiment Analysis dengan NLP (TextBlob)**

In [32]:
data_tweet = list(data['text_english'])
polaritas = 0

tot_positif = tot_negatif = tot_netral = total = 0
status = []

for i, tweet in enumerate(data_tweet):
    analysis = TextBlob(tweet)
    polaritas += analysis.polarity

    if analysis.sentiment.polarity > 0.0:
        tot_positif += 1
        status.append('Positif')
    elif analysis.sentiment.polarity == 0.0:
        tot_netral += 1
        status.append('Netral')
    else: 
        tot_negatif += 1
        status.append('Negatif')
        
    total += 1

print(f'Hasil Analisis Data:\nPositif = {tot_positif}\nNetral = {tot_netral}\nNegatif = {tot_negatif}')
print(f'\nTotal Data : {total}')



Hasil Analisis Data:
Positif = 775
Netral = 705
Negatif = 520

Total Data : 2000


In [33]:
# tambahkan status sentiment ke dataframe
status = pd.DataFrame({'klasifikasi': status})
data['klasifikasi'] = status
data.head(5)

Unnamed: 0,Tanggal,Username,text_english,klasifikasi
0,2022-10-31 23:59:58+00:00,lordkuyang,"in the past, the western upn stall, the caulif...",Positif
1,2022-10-31 23:58:49+00:00,s_h_y_l_l_a,all illnesses are blamed for the covid lieur v...,Netral
2,2022-10-31 23:58:14+00:00,kunsh72,didn't you used to be part of the regime after...,Positif
3,2022-10-31 23:57:58+00:00,erni076,"i changed it a long time ago, when covid attac...",Negatif
4,2022-10-31 23:57:38+00:00,KENTUSIAS,then my mask got covid kek it's already endemi...,Netral


In [34]:
data.sample(n=5)

Unnamed: 0,Tanggal,Username,text_english,klasifikasi
191,2022-10-31 17:37:39+00:00,katsutoshidessu,there is one time during the covid period the ...,Netral
446,2022-10-31 14:40:51+00:00,7vwsk7wsxm,aka long covid,Negatif
623,2022-10-31 13:11:37+00:00,kominfomagetan1,update on the distribution map of covid-1' as ...,Netral
1181,2022-10-31 09:21:08+00:00,r3ypo,"two weeks ago, some of the staff of my company...",Positif
297,2022-10-31 16:07:26+00:00,njaeminverse,saw many of my moots many people have sore thr...,Positif


## **4. Klasifikasi Data dengan Naive Bayes** (Classification)