# Mengenal Text Processing : Bags of Words & Stop Word Filtering

## Bag of Words model sebagai representasi text
Bag of Words menyederhanakan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. Text akan dikonversi menjadi lowercase dan tanda baca akan diabaikan.

### Dataset

In [4]:
corpus = [
    'linux has been around since mid 1990-s',
    'Linux distributions include the Linux-kernel',
    'Linux is one of the most prominent open-source software'
]
corpus

['linux has been around since mid 1990-s',
 'Linux distributions include the Linux-kernel',
 'Linux is one of the most prominent open-source software']

## Bag of Words model dengan CountVectorizer
- method todense() digunakan untuk mentransformasikan sebuah objek menjadi array
- Nilai pada array yang berkorelasi dengan feature_names merepresentasikan jumlah kemunculan token dalam kalimat

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [6]:
vectorizer.get_feature_names()

['1990',
 'around',
 'been',
 'distributions',
 'has',
 'include',
 'is',
 'kernel',
 'linux',
 'mid',
 'most',
 'of',
 'one',
 'open',
 'prominent',
 'since',
 'software',
 'source',
 'the']

## Euclidean Distance untuk mengukur kedekatan / jarak antar dokumen (vector)

In [7]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range (len(vectorized_X)):
    for j in range (i,len(vectorized_X)):
        if i == j :
            continue
        jarak = euclidean_distances(vectorized_X[i],vectorized_X[j])
        print(f'Jarak dokumen {i+1} dan {j+1} adalah {jarak}')

Jarak dokumen 1 dan 2 adalah [[3.31662479]]
Jarak dokumen 1 dan 3 adalah [[3.87298335]]
Jarak dokumen 2 dan 3 adalah [[3.46410162]]


## Stop Word pada Filtering Text
Stop Word menyerdehanakan representasi text dengan mengabaikan beberapa kata seperti determiners (the,a,an), auxiliary verbs (do,be,will) dan prepositions (on,in,at)

### Dataset

In [8]:
corpus

['linux has been around since mid 1990-s',
 'Linux distributions include the Linux-kernel',
 'Linux is one of the most prominent open-source software']

### Stop Word Filtering dengan countvectorizer

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [10]:
vectorizer.get_feature_names()

['1990',
 'distributions',
 'include',
 'kernel',
 'linux',
 'mid',
 'open',
 'prominent',
 'software',
 'source']

# Mengenal TF-IDF (Terms Frequency - Inverse Document Frequency)
- TF-IDF digunakan untuk mengukur seberapa penting sebuah kata terhadap dokumen dari sekumpulan dokumen atau corpus

### Dataset

In [11]:
corpus = [
    'the house had a tiny little mouse',
    'the cat saw the mouse',
    'the mouse ran away from the house',
    'the cat finally ate the  mouse',
    'the end of the mouse story'
]
corpus

['the house had a tiny little mouse',
 'the cat saw the mouse',
 'the mouse ran away from the house',
 'the cat finally ate the  mouse',
 'the end of the mouse story']

## TF-IDF Weights dengan TfidfVectorizer

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english')
response = vectorizer.fit_transform(corpus)
print(response)

  (0, 7)	0.2808823162882302
  (0, 6)	0.5894630806320427
  (0, 11)	0.5894630806320427
  (0, 5)	0.47557510189256375
  (1, 9)	0.7297183669435993
  (1, 2)	0.5887321837696324
  (1, 7)	0.3477147117091919
  (2, 1)	0.5894630806320427
  (2, 8)	0.5894630806320427
  (2, 7)	0.2808823162882302
  (2, 5)	0.47557510189256375
  (3, 0)	0.5894630806320427
  (3, 4)	0.5894630806320427
  (3, 2)	0.47557510189256375
  (3, 7)	0.2808823162882302
  (4, 10)	0.6700917930430479
  (4, 3)	0.6700917930430479
  (4, 7)	0.3193023297639811


In [13]:
vectorizer.get_feature_names()

['ate',
 'away',
 'cat',
 'end',
 'finally',
 'house',
 'little',
 'mouse',
 'ran',
 'saw',
 'story',
 'tiny']

In [14]:
response.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.4755751 , 0.58946308, 0.28088232, 0.        , 0.        ,
         0.        , 0.58946308],
        [0.        , 0.        , 0.58873218, 0.        , 0.        ,
         0.        , 0.        , 0.34771471, 0.        , 0.72971837,
         0.        , 0.        ],
        [0.        , 0.58946308, 0.        , 0.        , 0.        ,
         0.4755751 , 0.        , 0.28088232, 0.58946308, 0.        ,
         0.        , 0.        ],
        [0.58946308, 0.        , 0.4755751 , 0.        , 0.58946308,
         0.        , 0.        , 0.28088232, 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.67009179, 0.        ,
         0.        , 0.        , 0.31930233, 0.        , 0.        ,
         0.67009179, 0.        ]])

In [18]:
import pandas as pd 
df = pd.DataFrame(response.todense().T,
                 index = vectorizer.get_feature_names(),
                 columns=[f'D{i+1}' for i in range(len(corpus))])
df

Unnamed: 0,D1,D2,D3,D4,D5
ate,0.0,0.0,0.0,0.589463,0.0
away,0.0,0.0,0.589463,0.0,0.0
cat,0.0,0.588732,0.0,0.475575,0.0
end,0.0,0.0,0.0,0.0,0.670092
finally,0.0,0.0,0.0,0.589463,0.0
house,0.475575,0.0,0.475575,0.0,0.0
little,0.589463,0.0,0.0,0.0,0.0
mouse,0.280882,0.347715,0.280882,0.280882,0.319302
ran,0.0,0.0,0.589463,0.0,0.0
saw,0.0,0.729718,0.0,0.0,0.0
