# Latent Semantic Indexing (LSI)

Terdapat dokumen sebagai berikut:

In [1]:
d1 = "Shipment of gold damaged in a fire"
d2 = "Delivery of silver arrived in a silver truck"
d3 = "Shipment of gold arrived in a truck"

_Term Frequency_ akan digunakan untuk memberi bobot pada setiap _term_/kata pada dokumen dan query, juga akan diberlakukan aturan-aturan sebagai berikut:
- _stop words_ akan diabaikan
- _text_ akan di pecah perkata dan akan di _lowercase_
- proses _stemming_ tidak akan dilakukan
- _term_/kata akan diurutkan sesuai alfabet

__Permasalahan:__ Akan digunakan metode _LSI Latent Semantic Indexing_ untuk meranking dokumen dengan query:

In [2]:
query = "gold silver truck"

__LANGKAH 1:__ Beri bobot untuk masing-masing kata kemudian bangun matrix __A__ dan query matrix: 

In [3]:
import numpy as np

docs = [d1,d2,d3]
tokenized_doc = [d.lower().split() for d in docs]
tokenized_query = query.lower().split()

terms = set().union(*tokenized_doc)
sorted_terms = sorted(terms)

A = list()

for d in tokenized_doc:
  A.append([d.count(term) for term in sorted_terms])

Q = np.array([tokenized_query.count(term) for term in sorted_terms])
  
A = np.array(A)

print "\nMatrix A"
print A.transpose()
print "\n"
print "Matrix q"
print Q.transpose()


Matrix A
[[1 1 1]
 [0 1 1]
 [1 0 0]
 [0 1 0]
 [1 0 0]
 [1 0 1]
 [1 1 1]
 [1 1 1]
 [1 0 1]
 [0 2 0]
 [0 1 1]]


Matrix q
[0 0 0 0 0 1 0 0 0 1 1]


__LANGKAH 2:__ Uraikan matrix A, dan cari nilai Matrix U, S dan V

In [4]:
U,S,Vt = np.linalg.svd(A.transpose(), full_matrices=False)

print "U"
print U
print "S"
print S
print "V transpose"
print Vt

U
[[-0.42012157 -0.07479925 -0.04597244]
 [-0.29948676  0.20009226  0.40782766]
 [-0.12063481 -0.27489151 -0.4538001 ]
 [-0.157561    0.30464762 -0.2006467 ]
 [-0.12063481 -0.27489151 -0.4538001 ]
 [-0.26256057 -0.37944687  0.15467426]
 [-0.42012157 -0.07479925 -0.04597244]
 [-0.42012157 -0.07479925 -0.04597244]
 [-0.26256057 -0.37944687  0.15467426]
 [-0.315122    0.60929523 -0.40129339]
 [-0.29948676  0.20009226  0.40782766]]
S
[ 4.09887197  2.3615708   1.27366868]
V transpose
[[-0.49446664 -0.64582238 -0.58173551]
 [-0.64917576  0.71944692 -0.24691489]
 [-0.57799098 -0.25555741  0.77499473]]


__LANGKAH 3:__ Lakukan pendekatan _Rank 2_ dengan mengambil 2 kolom pertama dari matrix __U__ dan __V__ dan ambil 2 kolom dan row dari matrix __S__

In [5]:
Uk = U[:,:2]
Sk = S[:2]
Vtk = Vt[:2,:]

print "Matrix Uk"
print Uk
print "Matrix Sk"
print Sk
print "Matrix Vtk"
print Vtk

Matrix Uk
[[-0.42012157 -0.07479925]
 [-0.29948676  0.20009226]
 [-0.12063481 -0.27489151]
 [-0.157561    0.30464762]
 [-0.12063481 -0.27489151]
 [-0.26256057 -0.37944687]
 [-0.42012157 -0.07479925]
 [-0.42012157 -0.07479925]
 [-0.26256057 -0.37944687]
 [-0.315122    0.60929523]
 [-0.29948676  0.20009226]]
Matrix Sk
[ 4.09887197  2.3615708 ]
Matrix Vtk
[[-0.49446664 -0.64582238 -0.58173551]
 [-0.64917576  0.71944692 -0.24691489]]


__LANGKAH 4:__ Cari nilai vector dokumen yang baru dari matrix yang tereduksi sebelumnya

In [6]:
DOCS_eigen_vector = [(Vtk[0][a], Vtk[1][a]) for a in range(Vtk.shape[1])]

for index, doc_vector in enumerate(DOCS_eigen_vector):
    print "d{}({},{})".format(index+1, doc_vector[0], doc_vector[1])

d1(-0.494466642225,-0.649175761898)
d2(-0.64582237611,0.719446917487)
d3(-0.5817355054,-0.246914890364)


__LANGKAH 5:__ Temukan koordinat query vector yang baru

In [7]:
Sk_inverse = np.linalg.inv(np.diag(Sk))
new_Q_vector = Q.dot(Uk).dot(Sk_inverse)
print new_Q_vector

[-0.21400262  0.18205705]


__Step 6:__ Rank masing-masing dokumen secara descending sesuai nilai similaritas

In [8]:
from manual_tests import sim

similarities = list()
for index, doc in enumerate(DOCS_eigen_vector):
    similarities.append((index, sim(new_Q_vector, doc)))
    

rank_similarity = sorted(similarities, key=lambda item: -item[1])

print "Urutkan dokumen yang sama"
for index, sim_value in rank_similarity:
    print "d{} {}, nilai similaritas: {}".format(index+1, docs[index], sim_value)


Urutkan dokumen yang sama
d2 Delivery of silver arrived in a silver truck, nilai similaritas: 0.990987426748
d3 Shipment of gold arrived in a truck, nilai similaritas: 0.447959465828
d1 Shipment of gold damaged in a fire, nilai similaritas: -0.0539508436664
