# Mengenal Text Processing: Bag of Words & Stop World Filtering

## Bag of Words model sebagai representasi text

### Dataset

In [7]:
corpus = [
    'Linux has been around since the mid-1990s.',
    'Linux distributions include the linux kernel.',
    'Linux is one of the most prominent open-source software'
]

corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the linux kernel.',
 'Linux is one of the most prominent open-source software']

### Bag of Words model dengan CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).toarray()
vectorized_X

array([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]])

In [9]:
vectorizer.get_feature_names_out()

array(['1990s', 'around', 'been', 'distributions', 'has', 'include', 'is',
       'kernel', 'linux', 'mid', 'most', 'of', 'one', 'open', 'prominent',
       'since', 'software', 'source', 'the'], dtype=object)

## Euclidean Distance untuk mengukur kedekatan/Jarak antar dokumen (vector)

In [10]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range (len(vectorized_X)):
    for j in range(i, len(vectorized_X)):
        if i == j :
            continue
        data_i = vectorized_X[i].reshape(1, -1)
        data_j = vectorized_X[j].reshape(1, -1)
        jarak = euclidean_distances(data_i, data_j)
        print(f'Jarak Dokumen {i+1} dan {j+1} : {jarak[0][0]:.4f}')


Jarak Dokumen 1 dan 2 : 3.1623
Jarak Dokumen 1 dan 3 : 3.7417
Jarak Dokumen 2 dan 3 : 3.4641


## Stop Word Filtering pada Text

Stop word Filtering menyederhanakan representasi text dengan mengabaikan beberapa kata seperti determiners (the, a, an), auxiliary verbs (do, be will), dan preprositions (on, in, at)

In [11]:
corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the linux kernel.',
 'Linux is one of the most prominent open-source software']

### Stop Word Filtering dengan CountVectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]])

In [16]:
vectorizer.get_feature_names_out()

array(['1990s', 'distributions', 'include', 'kernel', 'linux', 'mid',
       'open', 'prominent', 'software', 'source'], dtype=object)