# **Crawling Data**

Crawling data adalah sebuah proses pengambilan data dari suatu sumber internet. Dalam hal ini saya mencrawling data dari twitter menggunakan twint. Data yang saya peroleh adalah sebanyak 180 data. Data tersebut saya saring sehingga menjadi 101 data karena terdapat beberapa data yang tidak penting. Data tersebut merupakan tweet dari tanggal 19-09-2022 sampai 27-09-2022.



In [None]:
%%capture
!git clone --depth=1 https://github.com/twintproject/twint.git
%cd twint
!pip3 install . -r requirements.txt

In [None]:
%%capture
!pip install nest-asyncio #install library nest-asyncio

In [None]:
%%capture
!pip install aiohttp==3.7.0 #install aiohttp

In [None]:
import nest_asyncio # import nest_asyncio
nest_asyncio.apply() #digunakan sekali untuk mengaktifkan tindakan serentak dalam notebook jupyter.
import twint #untuk import twint

## Hasil Crawling Data Twitter

In [None]:
c = twint.Config() #membuat variable c
c.Search = '#prabowo' #key word untuk data
c.Pandas = True
c.Limit = 500 #mencrawl 500 data
twint.run.Search(c) #run

1579369175763791872 2022-10-10 07:12:09 +0000 <pasti_2024> Langkah nyata dan kepedulian besar Pak Erick Thohir terhadap sepak bola Indonesia memang harus kita apresiasi bersama.  Komitmen beliau sangat besar dalam menjaga keberlangsungan hidup masyarakat. Seperti yang kali ini ditunjukkan langsung Pak @erickthohir.  #prabowo #erickthohir  https://t.co/9wYbD3G4wX
1579345378239209473 2022-10-10 05:37:35 +0000 <polsight> "NasDem mempertanyakan sikap Sekjen PDIP Hasto Kristiyanto yang tampak keras mengkritisi deklarasi Anies Baswedan sebagai calon presiden, tapi diam saja saat Prabowo Subianto." #aniesbaswedan #prabowo #pdiperjuangan #nasdem #capres2024 #Pemilu2024 #politik #beritaterkini #berita  https://t.co/weK0bPy94c
1579333327890579456 2022-10-10 04:49:42 +0000 <IdSinpo> Prabowo Ajak Kadesnya Berakhir Pekan ke Desa Butuh untuk Kembangkan Bojong Koneng  #Prabowo #Kades #BojongKoneng  https://t.co/u7ZEPGKMg5
1579321392331657217 2022-10-10 04:02:16 +0000 <papandu08> Pak Bowo lagi beli bu

In [None]:
Tweets_df = twint.storage.panda.Tweets_df
Tweets_df["tweet"].to_csv("prabowo.csv") #menyimpan ke prabowo.csv

# **Preposesing**

Preprocessing adalah tahap untuk "membersihkan data". data - data tersebut dibersihkan sehingga menjadi data yang bagus untuk diolah.

In [None]:
%%capture
!pip install numpy #untuk install numpy

In [None]:
%%capture
!pip install pandas #untuk install pandas

In [None]:
%%capture
!pip install nltk #untuk install nltk

In [None]:
%%capture
!pip install scikit-learn #untuk install scikit-learn

## **Import Library yang digunakan**



In [None]:
import numpy as np #import numpy
#Library untuk mengelola data dalam Dataframe
import pandas as pd

#Lbrary untuk Preprocessing
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords  #stopwords
from nltk import word_tokenize # tokenizing

#Untuk membuat vektor dan TFIDF
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer, CountVectorizer

#Untuk melakukan proses SVD
from sklearn.decomposition import TruncatedSVD

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
Tweets_df = pd.read_csv("https://raw.githubusercontent.com/maulidhan190081/file/main/prabowo2.csv") #mengambil data dari link

Tweets_df #print data

Unnamed: 0,tweet,label
0,Doa para kiai sepuh untuk Cak Imin dan Prabowo...,pro
1,Menteri Pertahanan (Menhan) Prabowo Subianto m...,pro
2,Semangat Juang #Ir_soekarno Ada Pada diri #pra...,pro
3,Negarawan sejati yang selalu ingin membawa ked...,pro
4,"""Zulfan Lindan bicara soal kepentingan negara ...",kontra
...,...,...
96,Masyarakat antusias menyambut kedatangan mente...,pro
97,"Anies Siap Maju di Pilpres 2024, Wagub Tak aka...",pro
98,Adian Napitupulu Sebut Presiden Jokowi Tidak A...,kontra
99,Elit Gerindra Desak Sandiaga Uno Mundur dari P...,kontra


## **Mengecilkan Semua Huruf**

In [None]:
Tweets_df['tweet']=Tweets_df['tweet'].str.replace(',', '') #menghilangkan koma
Tweets_df['tweet']=Tweets_df['tweet'].str.lower() #mengecilkan huruf
Tweets_df['tweet'] #print data kolom tweet

0      doa para kiai sepuh untuk cak imin dan prabowo...
1      menteri pertahanan (menhan) prabowo subianto m...
2      semangat juang #ir_soekarno ada pada diri #pra...
3      negarawan sejati yang selalu ingin membawa ked...
4      "zulfan lindan bicara soal kepentingan negara ...
                             ...                        
96     masyarakat antusias menyambut kedatangan mente...
97     anies siap maju di pilpres 2024 wagub tak akan...
98     adian napitupulu sebut presiden jokowi tidak a...
99     elit gerindra desak sandiaga uno mundur dari p...
100    buah duku buah kedondong 2024 gue prabowo dong...
Name: tweet, Length: 101, dtype: object

In [None]:
#Untuk menghapus angka
import re

#Untuk menghilangkan Punctuation
import string

## **Menghilangkan karakter spesial**

In [None]:
def remove_PTA_special(text):
    # remove tab, new line, ans back slice
    text = text.replace('\\t'," ").replace('\\n'," ").replace('\\u'," ").replace('\\',"")
    # remove non ASCII (emoticon, chinese word, .etc)
    text = text.encode('ascii', 'replace').decode('ascii')
    # remove mention, link, hashtag
    text = ' '.join(re.sub("([@#][A-Za-z0-9]+)|(\w+:\/\/\S+)"," ", text).split())
    # remove incomplete URL
    return text.replace("http://", " ").replace("https://", " ")
                
Tweets_df['tweet'] =Tweets_df['tweet'].apply(remove_PTA_special)

## **Menghilangkan Angka**

In [None]:
#remove number
def remove_number(text):
    return  re.sub(r"\d+", "", text)

Tweets_df['tweet'] = Tweets_df['tweet'].apply(remove_number)

## **Menghilangkan Tanda Baca**

In [None]:
#remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("","",string.punctuation))

Tweets_df['tweet'] = Tweets_df['tweet'].apply(remove_punctuation)

## **Menghilangkan Spasi Di Awal dan Akhir**

In [None]:
#remove whitespace leading & trailing
def remove_whitespace_LT(text):
    return text.strip()

Tweets_df['tweet'] = Tweets_df['tweet'].apply(remove_whitespace_LT)

## **Menjadikan 1 spasi**

In [None]:
#remove multiple whitespace into single whitespace
def remove_whitespace_multiple(text):
    return re.sub('\s+',' ',text)

Tweets_df['tweet'] = Tweets_df['tweet'].apply(remove_whitespace_multiple)

## **Menghapus Char**

In [None]:
# remove single char
def remove_singl_char(text):
    return re.sub(r"\b[a-zA-Z]\b", "", text)

Tweets_df['tweet'] = Tweets_df['tweet'].apply(remove_singl_char)

## **Word Tokenize**

In [None]:
# NLTK word tokenize 
def word_tokenize_wrapper(text):
    return word_tokenize(text)

Tweets_df['tweet'] = Tweets_df['tweet'].apply(word_tokenize_wrapper)
Tweets_df['tweet']

0      [doa, para, kiai, sepuh, untuk, cak, imin, dan...
1      [menteri, pertahanan, menhan, prabowo, subiant...
2           [semangat, juang, soekarno, ada, pada, diri]
3      [negarawan, sejati, yang, selalu, ingin, memba...
4      [zulfan, lindan, bicara, soal, kepentingan, ne...
                             ...                        
96     [masyarakat, antusias, menyambut, kedatangan, ...
97     [anies, siap, maju, di, pilpres, wagub, tak, a...
98     [adian, napitupulu, sebut, presiden, jokowi, t...
99     [elit, gerindra, desak, sandiaga, uno, mundur,...
100    [buah, duku, buah, kedondong, gue, prabowo, do...
Name: tweet, Length: 101, dtype: object

## **Stopwords**

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#Mendapatkan stopword indonesia
list_stopwords = stopwords.words('indonesian')

#Menghapus Stopword dari list token
def stopwords_removal(words):
    return [word for word in words if word not in list_stopwords]
Tweets_df['tweet'] = Tweets_df['tweet'].apply(stopwords_removal)

Tweets_df['tweet']

0      [doa, kiai, sepuh, cak, imin, prabowo, amin, y...
1      [menteri, pertahanan, menhan, prabowo, subiant...
2                            [semangat, juang, soekarno]
3                [negarawan, sejati, membawa, kedamaian]
4      [zulfan, lindan, bicara, kepentingan, negara, ...
                             ...                        
96     [masyarakat, antusias, menyambut, kedatangan, ...
97     [anies, maju, pilpres, wagub, dukung, riza, pa...
98     [adian, napitupulu, presiden, jokowi, maju, pe...
99     [elit, gerindra, desak, sandiaga, uno, mundur,...
100    [buah, duku, buah, kedondong, gue, prabowo, ge...
Name: tweet, Length: 101, dtype: object

In [None]:
Tweets_df.to_csv('TextPreprocessing.csv')

In [None]:
dataTextPre = pd.read_csv('TextPreprocessing.csv')
vectorizer = CountVectorizer(min_df=1)
bag = vectorizer.fit_transform(dataTextPre['tweet'])

# **TF**

In [None]:
print(vectorizer.vocabulary_)

{'doa': 124, 'kiai': 263, 'sepuh': 516, 'cak': 75, 'imin': 198, 'prabowo': 459, 'amin': 10, 'ya': 597, 'allah': 9, 'rabb': 473, 'alamin': 8, 'menteri': 346, 'pertahanan': 438, 'menhan': 339, 'subianto': 537, 'pujiannya': 471, 'presiden': 464, 'joko': 213, 'widodo': 594, 'jokowi': 214, 'mengakui': 327, 'kepemimpinan': 252, 'kenegarawanan': 249, 'semangat': 509, 'juang': 215, 'soekarno': 529, 'negarawan': 374, 'sejati': 502, 'membawa': 309, 'kedamaian': 233, 'zulfan': 603, 'lindan': 289, 'bicara': 65, 'kepentingan': 253, 'negara': 373, 'pilpres': 447, 'indonesia': 200, 'menurutnya': 347, 'capres': 78, 'figurnya': 153, 'menguntungkan': 336, 'as': 24, 'china': 85, 'rusia': 492, 'panglima': 391, 'tni': 564, 'ksad': 276, 'kompak': 271, 'salam': 497, 'komando': 268, 'muhaimin': 365, 'iskandar': 205, 'dekati': 97, 'puan': 468, 'ingatkan': 203, 'piagam': 445, 'koalisi': 265, 'id': 194, 'fakta': 150, 'berita': 55, 'terbaru': 549, 'yan': 598, 'permenas': 432, 'mandenas': 298, 'menyebut': 349, 'ra

In [None]:
matrik_vsm=bag.toarray()
#print(matrik_vsm)
matrik_vsm.shape

(101, 605)

In [None]:
matrik_vsm[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
a=vectorizer.get_feature_names()



In [None]:
print(len(matrik_vsm[:,1]))
#dfb =pd.DataFrame(data=matrik_vsm,index=df,columns=[a])
dataTF =pd.DataFrame(data=matrik_vsm,index=list(range(1, len(matrik_vsm[:,1])+1, )),columns=[a])
dataTF

101


Unnamed: 0,abdurachman,adian,airlangga,airlanggaprabowo,aja,ajak,akhlak,al,alamin,allah,...,wilayah,wonderwoman,ya,yan,yatim,yusuf,zon,zonmenilai,zulfan,zuperrr
1,0,0,0,0,0,0,0,0,1,1,...,0,0,2,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
99,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
datalabel = pd.read_csv('https://raw.githubusercontent.com/maulidhan190081/file/main/prabowo2.csv')
datatwett = pd.concat([dataTF.reset_index(), datalabel["label"]], axis=1)
datatwett

Unnamed: 0,"(index,)","(abdurachman,)","(adian,)","(airlangga,)","(airlanggaprabowo,)","(aja,)","(ajak,)","(akhlak,)","(al,)","(alamin,)",...,"(wonderwoman,)","(ya,)","(yan,)","(yatim,)","(yusuf,)","(zon,)","(zonmenilai,)","(zulfan,)","(zuperrr,)",label
0,1,0,0,0,0,0,0,0,0,1,...,0,2,0,0,0,0,0,0,0,pro
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pro
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pro
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pro
4,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,kontra
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,97,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pro
97,98,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pro
98,99,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,kontra
99,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,kontra


In [None]:
datatwett['label'].unique()

array(['pro', 'kontra'], dtype=object)

In [None]:
datatwett.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Columns: 607 entries, ('index',) to label
dtypes: int64(606), object(1)
memory usage: 479.1+ KB


In [None]:
### Train test split to avoid overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(datatwett.drop(labels=['label'], axis=1),
    datatwett['label'],
    test_size=0.3,
    random_state=0)

In [None]:
X_train

Unnamed: 0,"(index,)","(abdurachman,)","(adian,)","(airlangga,)","(airlanggaprabowo,)","(aja,)","(ajak,)","(akhlak,)","(al,)","(alamin,)",...,"(wilayah,)","(wonderwoman,)","(ya,)","(yan,)","(yatim,)","(yusuf,)","(zon,)","(zonmenilai,)","(zulfan,)","(zuperrr,)"
80,81,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
91,92,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
68,69,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
51,52,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27,28,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,98,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
67,68,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
64,65,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47,48,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [None]:
from sklearn.feature_selection import mutual_info_classif
# determine the mutual information
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info

array([1.14413737e-01, 0.00000000e+00, 0.00000000e+00, 1.23082828e-02,
       0.00000000e+00, 5.17053284e-02, 0.00000000e+00, 2.89646505e-02,
       2.77550384e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.54310552e-02, 0.00000000e+00, 7.67137781e-02, 0.00000000e+00,
       8.42825412e-03, 6.34281971e-03, 0.00000000e+00, 0.00000000e+00,
       3.15602412e-02, 0.00000000e+00, 6.06060168e-02, 0.00000000e+00,
       3.87357810e-02, 5.15079860e-02, 0.00000000e+00, 0.00000000e+00,
       5.08807421e-02, 9.06756346e-03, 0.00000000e+00, 0.00000000e+00,
       8.17317958e-02, 3.63976880e-03, 3.46595112e-02, 0.00000000e+00,
       2.01916098e-02, 2.53332604e-02, 3.80613021e-03, 1.24744466e-01,
       2.87333062e-02, 2.87252130e-02, 8.69185337e-02, 1.00705382e-01,
       0.00000000e+00, 0.00000000e+00, 7.98530793e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 7.39588347e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
      

In [None]:
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending=False)

(imin,)              0.211416
(emas,)              0.209840
(kementeriannya,)    0.200678
(pertiwi,)           0.190180
(joko,)              0.172136
                       ...   
(koalisi,)           0.000000
(klik,)              0.000000
(ketua,)             0.000000
(keselamatan,)       0.000000
(zuperrr,)           0.000000
Length: 606, dtype: float64

# **KNN**

In [2]:
from IPython.display import display, Math, Latex, HTML

Algoritma KNN merupakan algoritma klasifikasi yang bekerja dengan mengambil sejumlah K data terdekat (tetangganya) sebagai acuan untuk menentukan kelas dari data baru. Algoritma ini mengklasifikasikan data berdasarkan similarity atau kemiripan atau kedekatannya terhadap data lainnya.

Secara umum, cara kerja algoritma KNN adalah sebagai berikut.

1. Tentukan jumlah tetangga (K) yang akan digunakan untuk pertimbangan penentuan kelas. 
2. Hitung jarak dari data baru ke masing-masing data point di dataset.
3. Ambil sejumlah K data dengan jarak terdekat, kemudian tentukan kelas dari data baru tersebut.

Rumus Eclidean distance

<img src="https://latex.codecogs.com/gif.latex?\\dis\left&space;(&space;x_{1},x_{2}&space;\right&space;){}=\sqrt{\sum_{i=0}^{n}\left&space;(&space;x_{1}&space;-x_{2}\right&space;)^{2}}" title="\\dis\left ( x_{1},x_{2} \right ){}=\sqrt{\sum_{i=0}^{n}\left ( x_{1} -x_{2}\right )^{2}}" />





In [None]:
neigh = KNeighborsClassifier(n_neighbors=14)
neigh.fit(X_train, y_train)
Y_pred = neigh.predict(X_test) 
Y_pred



array(['pro', 'pro', 'pro', 'kontra', 'pro', 'pro', 'pro', 'pro',
       'kontra', 'pro', 'pro', 'pro', 'pro', 'pro', 'pro', 'pro', 'pro',
       'pro', 'pro', 'pro', 'pro', 'pro', 'pro', 'pro', 'pro', 'pro',
       'pro', 'pro', 'pro', 'pro', 'pro'], dtype=object)

In [None]:
y_test

26     kontra
86        pro
2         pro
55     kontra
75        pro
94     kontra
16     kontra
73        pro
54     kontra
96        pro
53     kontra
93     kontra
78     kontra
13        pro
7      kontra
30        pro
22     kontra
24        pro
33     kontra
8         pro
43        pro
62        pro
3         pro
71        pro
45     kontra
48        pro
6      kontra
100       pro
82     kontra
76        pro
60        pro
Name: label, dtype: object

In [None]:
from sklearn.metrics import make_scorer, accuracy_score,precision_score
Y_pred = neigh.predict(X_test) 
accuracy_neigh=round(accuracy_score(y_test,Y_pred)* 100, 2)
acc_neigh = round(neigh.score(X_train, y_train) * 100, 2)
accuracy_neigh



61.29

In [None]:
arre=[]
for i in range(len(testing)):
  testing[i][0] = neigh.predict(X_test) 
  accuracy_neigh=round(accuracy_score(y_test,testing[i][0])* 100, 2)
  acc_neigh = round(neigh.score(X_train, y_train) * 100, 2)
  arre.append(accuracy_neigh)
arre



[61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29,
 61.29]

# **KMeans**

K-Means Clustering adalah suatu metode penganalisaan data atau metode Data Mining yang melakukan proses pemodelan unssupervised learning dan menggunakan metode yang mengelompokan data berbagai partisi.

K means clustering merupakan metode algoritma dasar, yang diterapkan sebagai berikut:

1. Menentukan jumlah cluster
2. Menentukan centroid awal
3. Menghitung jarak data dengan centroid
4. Menentukan anggota cluster
5. menghitung rata-rata centroid tiap cluster
6. Ulangi no 3-5 sampai data cluster tidak berubah

Rumus Euclidean Distance:

<img src="https://latex.codecogs.com/gif.latex?\left&space;[&space;\left&space;(&space;x,y&space;\right&space;),\left&space;(&space;a,b&space;\right&space;)&space;\right&space;]=\sqrt{\left&space;(&space;x-a&space;\right&space;)^{2}&plus;\left&space;(&space;y-b&space;\right&space;)^{2}}" title="\left [ \left ( x,y \right ),\left ( a,b \right ) \right ]=\sqrt{\left ( x-a \right )^{2}+\left ( y-b \right )^{2}}" /> 


In [None]:
from sklearn.cluster import KMeans

kmeans =KMeans(n_clusters=2)
kmeans=kmeans.fit(dataTF)
prediksi=kmeans.predict(dataTF)
centroids = kmeans.cluster_centers_
centroids



array([[-5.20417043e-18,  2.08333333e-02,  2.08333333e-02, ...,
        -5.20417043e-18,  2.08333333e-02,  2.08333333e-02],
       [ 1.88679245e-02, -3.46944695e-18,  1.50943396e-01, ...,
         1.88679245e-02, -3.46944695e-18, -3.46944695e-18]])

In [None]:
df= pd.DataFrame(prediksi)
df

Unnamed: 0,0
0,1
1,1
2,0
3,0
4,0
...,...
96,1
97,1
98,0
99,0


In [None]:
centroid = pd.DataFrame(centroids)
centroid

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,595,596,597,598,599,600,601,602,603,604
0,-5.2041700000000004e-18,0.02083333,0.020833,0.02083333,0.02083333,-5.2041700000000004e-18,0.02083333,-5.2041700000000004e-18,-5.2041700000000004e-18,-5.2041700000000004e-18,...,-5.2041700000000004e-18,0.02083333,0.020833,-5.2041700000000004e-18,0.02083333,-5.2041700000000004e-18,-5.2041700000000004e-18,-5.2041700000000004e-18,0.02083333,0.02083333
1,0.01886792,-3.469447e-18,0.150943,-3.469447e-18,-3.469447e-18,0.01886792,-3.469447e-18,0.01886792,0.01886792,0.01886792,...,0.01886792,-3.469447e-18,0.037736,0.01886792,-3.469447e-18,0.01886792,0.01886792,0.01886792,-3.469447e-18,-3.469447e-18
