# Topic Modelling

## Crawling Data Dari Komentar Youtube


Program ini bertujuan untuk melakukan pengambilan data komentar pada sebuah video YouTube menggunakan Youtube Data API v3. Sebelum menggunakan program ini, pastikan bahwa layanan Youtube Data API telah diaktifkan dan API Key sudah di-generate.

Jika Anda belum memiliki API Key, berikut adalah langkah-langkah untuk memperolehnya:



1.   Login ke Google Developer Console (https://console.developers.google.com/) dengan akun Google Anda.

2.   Buatlah project baru dan isi informasi yang diminta.
3.   Aktifkan layanan API pada halaman project dan cari Youtube Data API v3.
4.   Buat kredensial agar API tersebut dapat digunakan. Klik tombol "Create Credential" dan lengkapi form yang diminta.

Anda dapat mengakses/melihat API Key pada tab Credentials di dashboard.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#import library
import pandas as pd
from googleapiclient.discovery import build
import numpy as np
from string import punctuation
import re
import nltk

In [3]:
#Membuat function untuk crawling data
def video_comments(video_id):
	# empty list for storing reply
	replies = []

	# creating youtube resource object
	youtube = build('youtube', 'v3', developerKey=api_key)

	# retrieve youtube video results
	video_response = youtube.commentThreads().list(part='snippet,replies', videoId=video_id).execute()

	# iterate video response
	while video_response:
		
		# extracting required info
		# from each result object
		for item in video_response['items']:
			
			# Extracting comments ()
			published = item['snippet']['topLevelComment']['snippet']['publishedAt']
			user = item['snippet']['topLevelComment']['snippet']['authorDisplayName']

			# Extracting comments
			comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
			likeCount = item['snippet']['topLevelComment']['snippet']['likeCount']

			replies.append([published, user, comment, likeCount])
			
			# counting number of reply of comment
			replycount = item['snippet']['totalReplyCount']

			# if reply is there
			if replycount>0:
				# iterate through all reply
				for reply in item['replies']['comments']:
					
					# Extract reply
					published = reply['snippet']['publishedAt']
					user = reply['snippet']['authorDisplayName']
					repl = reply['snippet']['textDisplay']
					likeCount = reply['snippet']['likeCount']
					
					# Store reply is list
					#replies.append(reply)
					replies.append([published, user, repl, likeCount])

			# print comment with list of reply
			#print(comment, replies, end = '\n\n')

			# empty reply list
			#replies = []

		# Again repeat
		if 'nextPageToken' in video_response:
			video_response = youtube.commentThreads().list(
					part = 'snippet,replies',
					pageToken = video_response['nextPageToken'], 
					videoId = video_id
				).execute()
		else:
			break
	#endwhile
	return replies


In [4]:
# isikan dengan api key Anda
api_key = 'AIzaSyBh0M9oTwSCOUUtyH1vfPllALm6PbV9R7c'

# url video = https://www.youtube.com/watch?v=LOvbNCf44TE
video_id = "LOvbNCf44TE" #isikan dengan kode / ID video

# Call function
comments = video_comments(video_id)

comments

[['2023-05-10T01:51:52Z',
  'chunchun lo',
  'walaupun surve menunjuk kan terendah pilihan ku ttep pak anis❤',
  0],
 ['2023-05-09T20:17:02Z',
  'Agung Suprianto',
  'Hidup pak ganjar, dari maluku Utara buat pak ganjar',
  0],
 ['2023-05-09T18:04:01Z', 'Indra Mohd92', '200ribu per kepala. Sadis ！', 0],
 ['2023-05-09T15:59:58Z', 'Wawansyah Wawa', 'Anis no 1  tetap', 0],
 ['2023-05-09T15:48:48Z',
  'Jeffrey Rinaldo',
  'Partai vs rakyat ???<br>Partainya hebat klo rkytnya gak milih mau apa loe??',
  0],
 ['2023-05-09T15:46:04Z',
  'boyak Tary',
  'Pak parabowo persiden 2024 2029 🤲amin amin amin',
  1],
 ['2023-05-09T15:33:35Z',
  'Fakhri Munan',
  'Anis mending ke laut aja...<br>Karena aku mengidolakan pak Ganjar sama pak Prabowo untuk jadi presiden.',
  0],
 ['2023-05-09T15:30:22Z',
  'Antonio Tohiri',
  'Bismillahirrahmanirrahim<br>   Mau ngakak hasil nya <br>Siapapun yang membohongin rakyat memakai cara curang semogga kelak orang orang bekerjasama dalam kecurangan dan kebohongan akan b

In [5]:
#menjadikan dataframe
df = pd.DataFrame(comments, columns=['publishedAt', 'authorDisplayName', 'text', 'likeCount'])
df

Unnamed: 0,publishedAt,authorDisplayName,text,likeCount
0,2023-05-10T01:51:52Z,chunchun lo,walaupun surve menunjuk kan terendah pilihan k...,0
1,2023-05-09T20:17:02Z,Agung Suprianto,"Hidup pak ganjar, dari maluku Utara buat pak g...",0
2,2023-05-09T18:04:01Z,Indra Mohd92,200ribu per kepala. Sadis ！,0
3,2023-05-09T15:59:58Z,Wawansyah Wawa,Anis no 1 tetap,0
4,2023-05-09T15:48:48Z,Jeffrey Rinaldo,Partai vs rakyat ???<br>Partainya hebat klo rk...,0
...,...,...,...,...
501,2023-05-08T23:39:09Z,Hasbi Abduh,Betul bro. Jadikan pengalaman menjadi pembelaj...,0
502,2023-05-08T23:34:57Z,+62 netral,Yang merasa sangat nyaman dengan pimpinan PDIp...,0
503,2023-05-08T17:32:05Z,metropolitan,Anis Baswedan aja,0
504,2023-05-08T16:20:17Z,antonbudiarto7,Prabowo kawan❤,0


In [7]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [8]:
df.to_csv('capres2024.csv', index=False)

## Preprocessing

### 1. Symbol & Punctuation Removal, case folding

Pada Tahap ini preprocessing yang dilakukan yaitu menghilangkan simbol dan tanda baca, serta melakukan case folding yaitu merubah seluruh huruf yang ada pada data menjadi huruf kecil

In [9]:
#proses menghilangkan simbol dan emoji
def remove_text_special (text):
  text = text.replace('\\t',"").replace('\\n',"").replace('\\u',"").replace('\\',"")
  text = text.encode('ascii', 'replace').decode('ascii')
  return text.replace("http://"," ").replace("https://", " ")
df['text'] = df['text'].apply(remove_text_special)
print(df['text'])

0      walaupun surve menunjuk kan terendah pilihan k...
1      Hidup pak ganjar, dari maluku Utara buat pak g...
2                            200ribu per kepala. Sadis ?
3                                       Anis no 1  tetap
4      Partai vs rakyat ???<br>Partainya hebat klo rk...
                             ...                        
501    Betul bro. Jadikan pengalaman menjadi pembelaj...
502    Yang merasa sangat nyaman dengan pimpinan PDIp...
503                                    Anis Baswedan aja
504                                       Prabowo kawan?
505                         Mantap, kulo dere anis mawon
Name: text, Length: 506, dtype: object


In [10]:
#menghilangkan tanda baca
def remove_tanda_baca(text):
  text = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text)
  return text

df['text'] = df['text'].apply(remove_tanda_baca)
df['text'].head(20)

0     walaupun surve menunjuk kan terendah pilihan k...
1     Hidup pak ganjar  dari maluku Utara buat pak g...
2                           200ribu per kepala  Sadis  
3                                      Anis no 1  tetap
4     Partai vs rakyat     br Partainya hebat klo rk...
5       Pak parabowo persiden 2024 2029  amin amin amin
6     Anis mending ke laut aja    br Karena aku meng...
7     Bismillahirrahmanirrahim br    Mau ngakak hasi...
8          Yu kita sama sama dukung p Anies Baswedan   
9                        Heran survei apa ini ya wkwkkw
10     Survy mah ngga aneh  sekedar menggiring opini   
11                                       Prabowo menang
12    PRABOWO urutan 1   GANJAR PRANOWO urutan ke 2 ...
13                     Anis   AHY yesssas  Yang lain No
14                                      Duit duit duit 
15            Survei itu tergantung siapa yg memesannya
16    apapun jenis survey kagak percaya sama sekali ...
17    Katakan TIDAK Pilih GANJAR Karena didukung

In [11]:
#proses menghilangkan angka
def remove_numbers (text):
  return re.sub(r"\d+", "", text)
df['text'] = df['text'].apply(remove_numbers)
df['text']

0      walaupun surve menunjuk kan terendah pilihan k...
1      Hidup pak ganjar  dari maluku Utara buat pak g...
2                               ribu per kepala  Sadis  
3                                        Anis no   tetap
4      Partai vs rakyat     br Partainya hebat klo rk...
                             ...                        
501    Betul bro  Jadikan pengalaman menjadi pembelaj...
502    Yang merasa sangat nyaman dengan pimpinan PDIp...
503                                    Anis Baswedan aja
504                                       Prabowo kawan 
505                         Mantap  kulo dere anis mawon
Name: text, Length: 506, dtype: object

In [12]:
# proses casefolding
def casefolding(Comment):
  Comment = Comment.lower()
  return Comment
df['text'] = df['text'].apply(casefolding)
df['text']

0      walaupun surve menunjuk kan terendah pilihan k...
1      hidup pak ganjar  dari maluku utara buat pak g...
2                               ribu per kepala  sadis  
3                                        anis no   tetap
4      partai vs rakyat     br partainya hebat klo rk...
                             ...                        
501    betul bro  jadikan pengalaman menjadi pembelaj...
502    yang merasa sangat nyaman dengan pimpinan pdip...
503                                    anis baswedan aja
504                                       prabowo kawan 
505                         mantap  kulo dere anis mawon
Name: text, Length: 506, dtype: object

### 2. Tokenizing
Pada tahap ini preprocessing yang dilakukan adalah tokenizing. Tokenizing adalah metode untuk melakukan pemisahan kata dalam suatu kalimat dengan tujuan untuk proses analisis teks lebih lanjut

In [13]:
#proses tokenisasi
# from nltk.tokenize import TweetTokenizer
nltk.download('punkt')
# def word_tokenize(text):
#   tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
#   return tokenizer.tokenize(text)

df['review_token'] = df['text'].apply(lambda sentence: nltk.word_tokenize(sentence))
df['review_token']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


0      [walaupun, surve, menunjuk, kan, terendah, pil...
1      [hidup, pak, ganjar, dari, maluku, utara, buat...
2                             [ribu, per, kepala, sadis]
3                                      [anis, no, tetap]
4      [partai, vs, rakyat, br, partainya, hebat, klo...
                             ...                        
501    [betul, bro, jadikan, pengalaman, menjadi, pem...
502    [yang, merasa, sangat, nyaman, dengan, pimpina...
503                                [anis, baswedan, aja]
504                                     [prabowo, kawan]
505                    [mantap, kulo, dere, anis, mawon]
Name: review_token, Length: 506, dtype: object

In [14]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [15]:
df['review_token'].to_csv('normalisasidata.csv', index=False)

### 3. Word Normalization
Pada tahap ini yang dilakukan yaitu normalisasi pada data. Hal tersebut dilakukan untuk merubah kata yang tidak baku menjadi kata baku

In [16]:
#Normalisasi kata tidak baku
normalize = pd.read_csv("/content/drive/MyDrive/prosaindata/normalisasidata.csv")

normalize_word_dict = {}

for row in normalize.iterrows():
  if row[0] not in normalize_word_dict:
    normalize_word_dict[row[0]] = row[1]

def normalized_term(comment):
  return [normalize_word_dict[term] if term in normalize_word_dict else term for term in comment]

df['comment_normalize'] = df['review_token'].apply(normalized_term)
df['comment_normalize'].head(20)

0     [walaupun, surve, menunjuk, kan, terendah, pil...
1     [hidup, pak, ganjar, dari, maluku, utara, buat...
2                            [ribu, per, kepala, sadis]
3                                     [anis, no, tetap]
4     [partai, vs, rakyat, br, partainya, hebat, klo...
5           [pak, parabowo, persiden, amin, amin, amin]
6     [anis, mending, ke, laut, aja, br, karena, aku...
7     [bismillahirrahmanirrahim, br, mau, ngakak, ha...
8     [yu, kita, sama, sama, dukung, p, anies, baswe...
9                 [heran, survei, apa, ini, ya, wkwkkw]
10    [survy, mah, ngga, aneh, sekedar, menggiring, ...
11                                    [prabowo, menang]
12    [prabowo, urutan, ganjar, pranowo, urutan, ke,...
13                 [anis, ahy, yesssas, yang, lain, no]
14                                   [duit, duit, duit]
15     [survei, itu, tergantung, siapa, yg, memesannya]
16    [apapun, jenis, survey, kagak, percaya, sama, ...
17    [katakan, tidak, pilih, ganjar, karena, di

### 4. Stopwords Removal
Pada tahap ini preprocessing yang dilakukan adalah menghilangkan kata yang tidak penting. Stopwords removal dilakukan 2 kali, yang pertama berdasarkan korpus yang ada di library python yaitu nltk, yang kedua berdasarkan file 'list_stopwords'

In [17]:
#Stopword Removal
nltk.download('stopwords')
from nltk.corpus import stopwords
txt_stopwords = stopwords.words('indonesian')

def stopwords_removal(filtering) :
  filtering = [word for word in filtering if word not in txt_stopwords]
  return filtering

df['stopwords_removal'] = df['comment_normalize'].apply(stopwords_removal)
df['stopwords_removal'].head(20)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


0            [surve, terendah, pilihan, ku, ttep, anis]
1                [hidup, ganjar, maluku, utara, ganjar]
2                                 [ribu, kepala, sadis]
3                                            [anis, no]
4     [partai, vs, rakyat, br, partainya, hebat, klo...
5                [parabowo, persiden, amin, amin, amin]
6     [anis, mending, laut, aja, br, mengidolakan, g...
7     [bismillahirrahmanirrahim, br, ngakak, hasil, ...
8                      [yu, dukung, p, anies, baswedan]
9                           [heran, survei, ya, wkwkkw]
10    [survy, mah, ngga, aneh, sekedar, menggiring, ...
11                                    [prabowo, menang]
12    [prabowo, urutan, ganjar, pranowo, urutan, puj...
13                             [anis, ahy, yesssas, no]
14                                   [duit, duit, duit]
15                 [survei, tergantung, yg, memesannya]
16    [apapun, jenis, survey, kagak, percaya, br, an...
17                    [pilih, ganjar, didukung, 

In [19]:
cd /content/drive/MyDrive/prosaindata

/content/drive/MyDrive/prosaindata


In [20]:
df['stopwords_removal'].to_csv('stopwords1.csv', index=False)

In [21]:
#stopword removal 2
data_stopwords = pd.read_csv("/content/drive/MyDrive/prosaindata/stopwords1.csv")
print(data_stopwords)

def stopwords_removal2(filter) :
  filter = [word for word in filter if word not in data_stopwords]
  return filter

df['stopwords_removal_final'] = df['stopwords_removal'].apply(stopwords_removal2)
df['stopwords_removal_final'].head(20)

                                     stopwords_removal
0    ['surve', 'terendah', 'pilihan', 'ku', 'ttep',...
1     ['hidup', 'ganjar', 'maluku', 'utara', 'ganjar']
2                          ['ribu', 'kepala', 'sadis']
3                                       ['anis', 'no']
4    ['partai', 'vs', 'rakyat', 'br', 'partainya', ...
..                                                 ...
501  ['bro', 'jadikan', 'pengalaman', 'pembelajaran...
502  ['nyaman', 'pimpinan', 'pdip', 'silahkan', 'pi...
503                        ['anis', 'baswedan', 'aja']
504                               ['prabowo', 'kawan']
505        ['mantap', 'kulo', 'dere', 'anis', 'mawon']

[506 rows x 1 columns]


0            [surve, terendah, pilihan, ku, ttep, anis]
1                [hidup, ganjar, maluku, utara, ganjar]
2                                 [ribu, kepala, sadis]
3                                            [anis, no]
4     [partai, vs, rakyat, br, partainya, hebat, klo...
5                [parabowo, persiden, amin, amin, amin]
6     [anis, mending, laut, aja, br, mengidolakan, g...
7     [bismillahirrahmanirrahim, br, ngakak, hasil, ...
8                      [yu, dukung, p, anies, baswedan]
9                           [heran, survei, ya, wkwkkw]
10    [survy, mah, ngga, aneh, sekedar, menggiring, ...
11                                    [prabowo, menang]
12    [prabowo, urutan, ganjar, pranowo, urutan, puj...
13                             [anis, ahy, yesssas, no]
14                                   [duit, duit, duit]
15                 [survei, tergantung, yg, memesannya]
16    [apapun, jenis, survey, kagak, percaya, br, an...
17                    [pilih, ganjar, didukung, 

### 5. Stemming
Pada tahap ini preprocessing yang dilakukan adalah stemming. Stemming adalah proses pemetaan dan penguraian bentuk dari suatu kata menjadi bentuk kata dasarnya.

In [27]:
pip install sastrawi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [28]:
pip install swifter

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting swifter
  Downloading swifter-1.3.4.tar.gz (830 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m830.9/830.9 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets>=7.0.0->swifter)
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25l[?25hdone
  Created wheel for swifter: filename=swifter-1.3.4-py3-none-any.whl size=16299 sha256=7578aecae8f998f74ba3cabfa38c84b0581f8bce49450f7bb92d5881e00d041f
  Stored in directory: /root/.cache/pip/wheels/6c/bd/3e/2d6afc9bc36c9975f8e4215a270bbac6580c4361ebd6bb2323
Successfully built swifter
Installing colle

In [29]:
#proses stem
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import string
import swifter
factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stemming (term):
  return stemmer.stem(term)

term_dict = {}
for document in df['stopwords_removal_final']:
  for term in document:
    if term not in term_dict:
      term_dict[term] = ''


In [30]:
print(len(term_dict))
print("-----------------------------")

1690
-----------------------------


In [31]:
for term in term_dict:
  term_dict[term] = stemming(term)
  print(term,":",term_dict[term])

print(term_dict)
print("-----------------------------")

surve : surve
terendah : rendah
pilihan : pilih
ku : ku
ttep : ttep
anis : anis
hidup : hidup
ganjar : ganjar
maluku : malu
utara : utara
ribu : ribu
kepala : kepala
sadis : sadis
no : no
partai : partai
vs : vs
rakyat : rakyat
br : br
partainya : partai
hebat : hebat
klo : klo
rkytnya : rkytnya
gak : gak
milih : milih
loe : loe
parabowo : parabowo
persiden : persiden
amin : amin
mending : mending
laut : laut
aja : aja
mengidolakan : idola
prabowo : prabowo
presiden : presiden
bismillahirrahmanirrahim : bismillahirrahmanirrahim
ngakak : ngakak
hasil : hasil
nya : nya
membohongin : membohongin
memakai : pakai
curang : curang
semogga : semogga
kelak : kelak
orang : orang
bekerjasama : bekerjasama
kecurangan : curang
kebohongan : bohong
bertanggung : tanggung
hadapan : hadap
allah : allah
swt : swt
akhirat : akhirat
aamiin : aamiin
allahumma : allahumma
yu : yu
dukung : dukung
p : p
anies : anies
baswedan : baswedan
heran : heran
survei : survei
ya : ya
wkwkkw : wkwkkw
survy : survy
mah :

In [32]:
def get_stemming(document):
  return [term_dict[term] for term in document]

In [33]:
df['stemming'] = df['stopwords_removal_final'].swifter.apply(get_stemming)

Pandas Apply:   0%|          | 0/506 [00:00<?, ?it/s]

In [34]:
print(df['stemming'])

0                 [surve, rendah, pilih, ku, ttep, anis]
1                   [hidup, ganjar, malu, utara, ganjar]
2                                  [ribu, kepala, sadis]
3                                             [anis, no]
4      [partai, vs, rakyat, br, partai, hebat, klo, r...
                             ...                        
501                 [bro, jadi, alam, ajar, bodoh, kali]
502    [nyaman, pimpin, pdip, silah, pilih, tugas, ub...
503                                [anis, baswedan, aja]
504                                     [prabowo, kawan]
505                    [mantap, kulo, dere, anis, mawon]
Name: stemming, Length: 506, dtype: object


In [35]:
df.head(20)

Unnamed: 0,publishedAt,authorDisplayName,text,likeCount,review_token,comment_normalize,stopwords_removal,stopwords_removal_final,stemming
0,2023-05-10T01:51:52Z,chunchun lo,walaupun surve menunjuk kan terendah pilihan k...,0,"[walaupun, surve, menunjuk, kan, terendah, pil...","[walaupun, surve, menunjuk, kan, terendah, pil...","[surve, terendah, pilihan, ku, ttep, anis]","[surve, terendah, pilihan, ku, ttep, anis]","[surve, rendah, pilih, ku, ttep, anis]"
1,2023-05-09T20:17:02Z,Agung Suprianto,hidup pak ganjar dari maluku utara buat pak g...,0,"[hidup, pak, ganjar, dari, maluku, utara, buat...","[hidup, pak, ganjar, dari, maluku, utara, buat...","[hidup, ganjar, maluku, utara, ganjar]","[hidup, ganjar, maluku, utara, ganjar]","[hidup, ganjar, malu, utara, ganjar]"
2,2023-05-09T18:04:01Z,Indra Mohd92,ribu per kepala sadis,0,"[ribu, per, kepala, sadis]","[ribu, per, kepala, sadis]","[ribu, kepala, sadis]","[ribu, kepala, sadis]","[ribu, kepala, sadis]"
3,2023-05-09T15:59:58Z,Wawansyah Wawa,anis no tetap,0,"[anis, no, tetap]","[anis, no, tetap]","[anis, no]","[anis, no]","[anis, no]"
4,2023-05-09T15:48:48Z,Jeffrey Rinaldo,partai vs rakyat br partainya hebat klo rk...,0,"[partai, vs, rakyat, br, partainya, hebat, klo...","[partai, vs, rakyat, br, partainya, hebat, klo...","[partai, vs, rakyat, br, partainya, hebat, klo...","[partai, vs, rakyat, br, partainya, hebat, klo...","[partai, vs, rakyat, br, partai, hebat, klo, r..."
5,2023-05-09T15:46:04Z,boyak Tary,pak parabowo persiden amin amin amin,1,"[pak, parabowo, persiden, amin, amin, amin]","[pak, parabowo, persiden, amin, amin, amin]","[parabowo, persiden, amin, amin, amin]","[parabowo, persiden, amin, amin, amin]","[parabowo, persiden, amin, amin, amin]"
6,2023-05-09T15:33:35Z,Fakhri Munan,anis mending ke laut aja br karena aku meng...,0,"[anis, mending, ke, laut, aja, br, karena, aku...","[anis, mending, ke, laut, aja, br, karena, aku...","[anis, mending, laut, aja, br, mengidolakan, g...","[anis, mending, laut, aja, br, mengidolakan, g...","[anis, mending, laut, aja, br, idola, ganjar, ..."
7,2023-05-09T15:30:22Z,Antonio Tohiri,bismillahirrahmanirrahim br mau ngakak hasi...,0,"[bismillahirrahmanirrahim, br, mau, ngakak, ha...","[bismillahirrahmanirrahim, br, mau, ngakak, ha...","[bismillahirrahmanirrahim, br, ngakak, hasil, ...","[bismillahirrahmanirrahim, br, ngakak, hasil, ...","[bismillahirrahmanirrahim, br, ngakak, hasil, ..."
8,2023-05-09T14:30:16Z,Taufiq P,yu kita sama sama dukung p anies baswedan,0,"[yu, kita, sama, sama, dukung, p, anies, baswe...","[yu, kita, sama, sama, dukung, p, anies, baswe...","[yu, dukung, p, anies, baswedan]","[yu, dukung, p, anies, baswedan]","[yu, dukung, p, anies, baswedan]"
9,2023-05-09T13:54:07Z,cindua mato 1111,heran survei apa ini ya wkwkkw,0,"[heran, survei, apa, ini, ya, wkwkkw]","[heran, survei, apa, ini, ya, wkwkkw]","[heran, survei, ya, wkwkkw]","[heran, survei, ya, wkwkkw]","[heran, survei, ya, wkwkkw]"


## Feature Extraction (TF-IDF)

In [38]:
def joinkata(data):
  kalimat = ""
  for i in data:
    kalimat += i
    kalimat += " "
  return kalimat

text = df['stemming'].swifter.apply(joinkata)
text

Pandas Apply:   0%|          | 0/506 [00:00<?, ?it/s]

0                       surve rendah pilih ku ttep anis 
1                        hidup ganjar malu utara ganjar 
2                                     ribu kepala sadis 
3                                               anis no 
4      partai vs rakyat br partai hebat klo rkytnya g...
                             ...                        
501                       bro jadi alam ajar bodoh kali 
502    nyaman pimpin pdip silah pilih tugas ubah sila...
503                                   anis baswedan aja 
504                                       prabowo kawan 
505                         mantap kulo dere anis mawon 
Name: stemming, Length: 506, dtype: object

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize document using TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
                        stop_words='english',
                        ngram_range = (1,1)
                        )

# Fit and Transform the documents
X = tfidf.fit_transform(text)

In [40]:
print(X)

  (0, 78)	0.23777674421097308
  (0, 1350)	0.5231948091766815
  (0, 665)	0.4986407710106773
  (0, 982)	0.2797190151979825
  (0, 1076)	0.4294268248330638
  (0, 1250)	0.39804103770760446
  (1, 1392)	0.574000054449248
  (1, 733)	0.45224202225503946
  (1, 380)	0.5113494344434195
  (1, 458)	0.45224202225503946
  (2, 1099)	0.627656690290573
  (2, 606)	0.5610867980833005
  (2, 1083)	0.5396560795081563
  (3, 78)	1.0
  (4, 703)	0.32821719549666706
  (4, 787)	0.31568093484637116
  (4, 375)	0.2299467091653322
  (4, 1089)	0.3671583778581572
  (4, 630)	0.26649688171996105
  (4, 453)	0.3443792464429413
  (4, 203)	0.17158143696453781
  (4, 1057)	0.18861451699698084
  (4, 1401)	0.3443792464429413
  (4, 928)	0.48136048052049873
  (5, 61)	0.8704962999677884
  :	:
  (501, 24)	0.463985278425688
  (501, 39)	0.3655637991413358
  (501, 209)	0.3750440059153058
  (501, 185)	0.4147745387835119
  (501, 515)	0.4351988411022629
  (501, 571)	0.38598810146008683
  (502, 1175)	0.6816052029596734
  (502, 882)	0.3408026

In [41]:
df_tfidf = pd.DataFrame(
    X.toarray().T, columns=[f'D{i+1}' for i in range(len(text))], index=tfidf.get_feature_names_out()
)
df_tfidf

Unnamed: 0,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,...,D497,D498,D499,D500,D501,D502,D503,D504,D505,D506
aamiin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.461088,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abadi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abri,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
yuk,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
yutubes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zolimi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zonk,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Latent Semantic Analysis (LSA)

In [42]:
from sklearn.decomposition import TruncatedSVD

In [43]:
# Define the number of topics or components
num_components=10

# Create SVD object
lsa = TruncatedSVD(n_components=num_components, n_iter=100, random_state=42)

# Fit SVD model on data
lsa.fit_transform(X)

# Get Singular values and Components 
Sigma = lsa.singular_values_ 
V_transpose = lsa.components_.T

In [44]:
# Print the topics with their terms
terms = tfidf.get_feature_names_out()

for index, component in enumerate(lsa.components_):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:5]
    top_terms_list=list(dict(top_terms_key).keys())
    print("Topic "+str(index)+": ",top_terms_list)

Topic 0:  ['prabowo', 'presiden', 'br', 'anis', 'ganjar']
Topic 1:  ['abal', 'survey', 'anies', 'anis', 'survei']
Topic 2:  ['abal', 'prabowo', 'survey', 'surve', 'percaya']
Topic 3:  ['anies', 'baswedan', 'dukung', 'survey', 'best']
Topic 4:  ['anis', 'baswedan', 'anies', 'prabowo', 'urut']
Topic 5:  ['percaya', 'survey', 'survei', 'bayar', 'yg']
Topic 6:  ['survei', 'menang', 'yg', 'bayar', 'pilih']
Topic 7:  ['ganjar', 'survey', 'percaya', 'presiden', 'yg']
Topic 8:  ['presiden', 'insyaallah', 'yg', 'indonesia', 'moga']
Topic 9:  ['yg', 'menang', 'rakyat', 'pilih', 'partai']
