<a href="https://colab.research.google.com/github/ihyaulumuddin044/machineLearning/blob/main/Text_Processing_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text Processing; bag of word and stop word filtering

Bag of Words (BoW) adalah salah satu teknik representasi teks dalam bentuk numerik yang digunakan dalam Natural Language Processing (NLP) dan Machine Learning. Pendekatan ini sederhana tetapi efektif untuk mengubah data teks menjadi data numerik agar dapat digunakan dalam model machine learning.

In [44]:
corpus = [
    'linux has been around since the mid-1990s.',
    'linux distributions include the linux kernel.',
    'linux is one of the most prominent open-source software.'
]

corpus


['linux has been around since the mid-1990s.',
 'linux distributions include the linux kernel.',
 'linux is one of the most prominent open-source software.']

# bag of word dengan menggunakan CountVectorize

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_x = vectorizer.fit_transform(corpus).toarray()
vectorized_x

array([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]])

In [46]:
vectorizer.get_feature_names_out()

array(['1990s', 'around', 'been', 'distributions', 'has', 'include', 'is',
       'kernel', 'linux', 'mid', 'most', 'of', 'one', 'open', 'prominent',
       'since', 'software', 'source', 'the'], dtype=object)

In [47]:
print("Shape of vectorized_x:", vectorized_x.shape)
print("Contents of vectorized_x:\n", vectorized_x)


Shape of vectorized_x: (3, 19)
Contents of vectorized_x:
 [[1 1 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1]
 [0 0 0 1 0 1 0 1 2 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 1]]


# Euclidean Distance untuk mengukur pendekatan/jarak antar dokumen(vector)

In [51]:
from sklearn.metrics.pairwise import euclidean_distances
print("Loop dimulai")
for i in range(len(vectorized_x)):
    for j in range(i + 1, len(vectorized_x)):  # Mulai dari i + 1 agar tidak menghitung ulang
        print(f"Memproses pasangan ({i+1}, {j+1})")
        jarak = euclidean_distances([vectorized_x[i]], [vectorized_x[j]])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak[0][0]}')


Loop dimulai
Memproses pasangan (1, 2)
Jarak dokumen 1 dan 2: 3.1622776601683795
Memproses pasangan (1, 3)
Jarak dokumen 1 dan 3: 3.7416573867739413
Memproses pasangan (2, 3)
Jarak dokumen 2 dan 3: 3.4641016151377544


In [54]:
print("Loop dimulai")
for i in range(len(vectorized_x)):
    for j in range(i , len(vectorized_x)):  # Mulai dari i + 1 agar tidak menghitung ulang
        if i == j:
            print(f'Melewati perhitungan jarak untuk dokumen {i+1} dan {j+1} (karena sama)')
            continue  # Melewati perhitungan untuk pasangan dokumen yang sama
        print(f"Memproses pasangan ({i+1}, {j+1})")
        jarak = euclidean_distances([vectorized_x[i]], [vectorized_x[j]])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

Loop dimulai
Melewati perhitungan jarak untuk dokumen 1 dan 1 (karena sama)
Memproses pasangan (1, 2)
Jarak dokumen 1 dan 2: [[3.16227766]]
Memproses pasangan (1, 3)
Jarak dokumen 1 dan 3: [[3.74165739]]
Melewati perhitungan jarak untuk dokumen 2 dan 2 (karena sama)
Memproses pasangan (2, 3)
Jarak dokumen 2 dan 3: [[3.46410162]]
Melewati perhitungan jarak untuk dokumen 3 dan 3 (karena sama)


# Stop word filtering

Stop word filtering adalah proses menghapus kata-kata yang dianggap tidak penting dalam analisis teks, seperti kata-kata umum yang tidak memberikan banyak informasi, misalnya "the", "and", "is", "in", dll. Tujuan utama dari stop word filtering adalah untuk menyederhanakan data teks dan meningkatkan kualitas analisis teks, seperti dalam klasifikasi teks atau pemodelan bahasa.

In [55]:
dataset = corpus
dataset

['linux has been around since the mid-1990s.',
 'linux distributions include the linux kernel.',
 'linux is one of the most prominent open-source software.']

#stop word filtering menggunkan countVectorizer

In [57]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_0 = CountVectorizer(stop_words='english')
vectorized_x = vectorizer_0.fit_transform(dataset).todense()
vectorized_x

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]])

In [58]:
vectorizer_0.get_feature_names_out()

array(['1990s', 'distributions', 'include', 'kernel', 'linux', 'mid',
       'open', 'prominent', 'software', 'source'], dtype=object)