# Scikit Learn 10: Text Processing: Bag of Words dan Stop Word Filtering

Text processing digunakan agar komputer dapat mengenali kata atau kalimat dalam dokumen. Dalam bidang machine learning, feature extraction dari teks menggunakan natural language processing (NLP).

Bag of Words sebagai Representasi Teks

Bag of Words merepresentasikan teks atau kata tanpa memperhatikan tata kalimat dan letaknya, setaip huruf dalam teks diubah menjadi huruf kecil (lowercase) dan tanda baca diabaikan.

In [2]:
#Dataset untuk Bag of Words ini berupa beberapa kalimat pendek (disebut corpus)
corpus = [
    'Linux has been around since the mid-1990s.',
    'Linux distributions include the Linux kernel.',
    'Linux is one of the most prominent open-source software.'
]

corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software.']

Melakukan Bag of Words untuk Feature Extraction dari dataset dengan CountVectorizer

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()#variabel panggilan untuk menggunakan fungsi CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).todense()#.todese() menjadikan corpus menjadi array dan fit.transform(corpus) menjadikan corpus dijadikan array 2 dimensi
vectorized_X

<3x19 sparse matrix of type '<class 'numpy.int64'>'
	with 23 stored elements in Compressed Sparse Row format>

In [8]:
vectorizer.get_feature_names_out()
#setiap kata yang berada dalam keranjang disebut dengan token
#angka yang ada di dalam keranjang merepresentasikan berapa kali kemunculan kata dimulai dari angka 0. 

array(['1990s', 'around', 'been', 'distributions', 'has', 'include', 'is',
       'kernel', 'linux', 'mid', 'most', 'of', 'one', 'open', 'prominent',
       'since', 'software', 'source', 'the'], dtype=object)

In [5]:
#representasi teks bag of words membantu training dari model atau algoritma machine learning lebih mudah mengukur kedekatan atau kemiripan antar dokumen.
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_X)):
    for j in range(i, len(vectorized_X)):
        if i == j:
            continue
        jarak = euclidean_distances(vectorized_X[i], vectorized_X[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

TypeError: np.matrix is not supported. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html

Stop Word Filtering pada Text

Menyederhanakan representasi text dengan mengabaikan kata seperti determiners (the, a, an), auxiliary verbs(do, be, will), dan prepositions (on, in, at)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).todense()#.todese() menjadikan corpus menjadi array dan fit.transform(corpus) menjadikan corpus dijadikan array 2 dimensi
vectorized_X

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [10]:
vectorizer.get_feature_names_out()

array(['1990s', 'distributions', 'include', 'kernel', 'linux', 'mid',
       'open', 'prominent', 'software', 'source'], dtype=object)

# Scikit Learn 11: TF-IDF (Term Frequency - Inverse Document Frequency)
    
TF-IDF untuk menghitung bobot suatu kata terhadap suatu dokumen dari sekumpulan dokumen atau metode statistik untuk mengukur seberapa penting suatu kata dalam dokumen dari sekumpulan corpus, TF-IDF digunakan dalam bidang information retrieval.

Untuk kalkulasi Term Frequency, formula yang paling sederhana yaitu menghitung jumlah kemunculan term/kata: row count.

In [2]:
#dataset text (corpus)
corpus = [
    'the house had a tiny little mouse', 
    'the cat saw the mouse',
    'the mouse ran away from the house', 
    'the cat finally ate the mouse',
    'the end of the mouse story'
]

corpus

['the house had a tiny little mouse',
 'the cat saw the mouse',
 'the mouse ran away from the house',
 'the cat finally ate the mouse',
 'the end of the mouse story']

TD-IDF Weights dengan TfidfVectorizer

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
response = vectorizer.fit_transform(corpus)
print(response)
#hasil: (index kalimat dari dataset,token)    bobot dari TF-IDF hasil TfidVectorizer

  (0, 7)	0.2808823162882302
  (0, 6)	0.5894630806320427
  (0, 11)	0.5894630806320427
  (0, 5)	0.47557510189256375
  (1, 9)	0.7297183669435993
  (1, 2)	0.5887321837696324
  (1, 7)	0.3477147117091919
  (2, 1)	0.5894630806320427
  (2, 8)	0.5894630806320427
  (2, 7)	0.2808823162882302
  (2, 5)	0.47557510189256375
  (3, 0)	0.5894630806320427
  (3, 4)	0.5894630806320427
  (3, 2)	0.47557510189256375
  (3, 7)	0.2808823162882302
  (4, 10)	0.6700917930430479
  (4, 3)	0.6700917930430479
  (4, 7)	0.3193023297639811


In [4]:
#melihat isi token
vectorizer.get_feature_names_out()

array(['ate', 'away', 'cat', 'end', 'finally', 'house', 'little', 'mouse',
       'ran', 'saw', 'story', 'tiny'], dtype=object)

In [5]:
#Array 2 dimensi dari kalimat
response.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.4755751 , 0.58946308, 0.28088232, 0.        , 0.        ,
         0.        , 0.58946308],
        [0.        , 0.        , 0.58873218, 0.        , 0.        ,
         0.        , 0.        , 0.34771471, 0.        , 0.72971837,
         0.        , 0.        ],
        [0.        , 0.58946308, 0.        , 0.        , 0.        ,
         0.4755751 , 0.        , 0.28088232, 0.58946308, 0.        ,
         0.        , 0.        ],
        [0.58946308, 0.        , 0.4755751 , 0.        , 0.58946308,
         0.        , 0.        , 0.28088232, 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.67009179, 0.        ,
         0.        , 0.        , 0.31930233, 0.        , 0.        ,
         0.67009179, 0.        ]])

In [7]:
#konversi ke bentuk pandas dataframe
import pandas as pd

df = pd.DataFrame(response.todense().T,#.T untuk transpose baris untuk token kata dan kolom untuk kalimat (dokumen)
                 index=vectorizer.get_feature_names_out(),
                 columns=[f'D{i+1}' for i in range(len(corpus))])

df

Unnamed: 0,D1,D2,D3,D4,D5
ate,0.0,0.0,0.0,0.589463,0.0
away,0.0,0.0,0.589463,0.0,0.0
cat,0.0,0.588732,0.0,0.475575,0.0
end,0.0,0.0,0.0,0.0,0.670092
finally,0.0,0.0,0.0,0.589463,0.0
house,0.475575,0.0,0.475575,0.0,0.0
little,0.589463,0.0,0.0,0.0,0.0
mouse,0.280882,0.347715,0.280882,0.280882,0.319302
ran,0.0,0.0,0.589463,0.0,0.0
saw,0.0,0.729718,0.0,0.0,0.0
