# MLlab_LSI_query

This file contains an implementation of an information query method, based on Latent Semantic Indexing (LSI) technique. LSI is a mature and wildely used technique in nature language processing. It use a rank-reduced Singular Value Decomposition( SVD) on TF-IDF matrix ( introduction of TF-IDF matrix is in file TFIDF_company_chart). LSI considers words which occur in the same document tends to have relationship, and this rank-reduced  SVD can merge these words with similar meaning or strong relationship. One advantage of information query method based on LSI is that it can search highly related contents, even though the words in query may not appear in these contents. Following are some Mathematic details.


The procedure is:

   1. We at first load 1000 clean documents, and use function TfidfVectorizer from sklearn libary to build their TF-IDF matrix.  

   2. Given the query string, we look it as a new document, and also get its TF-IDF vector( since only one document) according to the TF-IDF matrix we get in step 1. Here we assume all input queries do not contain repeat words. Therefore for every words in the query, we simply find its correspondent index in TF-IDF matrix, and assign its value as 1 * IDF(index). And for other words which do not in the query, we simply assign them as 0.  

   3. Apply rank reduced SVD at TF-IDF matrix, we get:
    <img src="https://latex.codecogs.com/svg.latex?A&space;\approx&space;A_{k}&space;=&space;U_{K}S_{K}V_k^T" title="A \approx A_{k} = U_{K}S_{K}V_k^T" />
    Here this A is original TF-IDF matrix, and Ak is its K dimension approximation. We can simply get this Ak by evaluating the formula above. Here we can get Sk, Tk and Dk by letting the singluar value matrix S only preserve k max singluar value( Sk), and let left singluar vector matrix U, right singluar vector matrix V only preserve their first k row vectors and k col vectors, respectively.
    In the code, we simply assign k = 100, and use scipy's sparse SVDs function to evaluate this formula efficiently. After this step, we transfer this TF-IDF matrix to low dimension space.  
    
   4. To compute similarity of query and document vectors in low dimension space, we should also transfer the query vector from step 2 to this low dimension space. We do that by evaluating this formula:
<img src="https://latex.codecogs.com/svg.latex?q_{k}&space;=&space;S^{-1}_{k}U^T_dq" title="q_{k} = S^{-1}_{k}U^T_dq" />
    Here qk is query's k dimension vector. And q is its original vector. This works because exactly every col vector in right singluar matrix V describle one document, and in LSI technique we also see the query as a new document.  
    
   5. Finally, we use cosine similarities to compute the similarities between the query vector and documents vectors. And take the top 10 similar documents as result. The cosine similarities formular is:
   <img src="https://latex.codecogs.com/svg.latex?sim(q,d)&space;=&space;\frac{q*d}{\begin{vmatrix}&space;q&space;\end{vmatrix}\begin{vmatrix}&space;d&space;\end{vmatrix}}" title="sim(q,d) = \frac{q*d}{\begin{vmatrix} q \end{vmatrix}\begin{vmatrix} d \end{vmatrix}}" />
   This cosin similarities only consider how two vectors' direction are similar. This is useful, since usually the scala value of query vector is smaller than the document vector because here query does not contain any repeat word.
   
In code below, I use "Sportbekleidung schuh" as a query, and the top 10 similar (or related) documents I got from this program are 8 PUMA and 1 Addidas and 1 Zalando, which all sell sportswear and shoes. This result shows that this program works. The keypoint here is that I actually do not know if these 2 words occur in these top 10 documents, but LSI technique can still show me these highly related result. This is impossible with traditional key word search.



In [None]:
import spacy
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS
from sklearn.cluster import KMeans
import time
import numpy as np
from scipy.sparse.linalg import svds, eigs

In [181]:
%run src/file_utils.py
%run src/configuration.py
%run 'load_and_prepro_document.ipynb'

## LSI
LSI is actually just doing SVD at TF-IDF matrix, and to get an approximate TF-IDF matrix with low number of dimension.
Here I try to use LSI to realize a information retrieval application.

In [182]:
# here just use os lib to get the 1000 documents in this folder
import os
documents_list = list()
for root, dirs, files in os.walk("./LabShare/data/all/json", topdown=False):
    for name in files:
        documents_list.append(name)
documents_list = documents_list[:1000]

In [183]:
start_time = time.time()
# here I override the preProcess() in fit_transform(). Because the input data is already preprocessed.
def preProcess(s):
    return s
my_doc, my_doc_name = get_clean_data(documents_list)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform( my_doc )
print (time.time() - start_time)

7.5903236865997314


In [184]:
# now compute the input query's vector.
query = 'Sportbekleidung schuh '   #query string

# step 1, do same preprosseing for this query
nlp = spacy.load("de")
sentence = nlp(query, disable=['parser', 'ner'])
filtered_words = [word for word in sentence if word.lower_ not in STOP_WORDS]
filtered_words_withoutdigits = [word for word in filtered_words if not word.is_digit]
filtered_words_withoutpunc = [word for word in filtered_words_withoutdigits if word.pos_ != 'PUNCT']
filtered_lemmas = [word.lemma_ for word in filtered_words_withoutpunc]

vocabularly = set()
for word in filtered_lemmas:
    vocabularly.add(word.replace('\n', '').strip().lower())

new_vocab = set()
for u in vocabularly:
    if u != '':
        new_vocab.add(u)

# step 2, generate query's tf-idf vector
query_vector_ori = np.zeros(tfidf_matrix.shape[1]) #initilize the query vector
idf = vectorizer.idf_
feature_name = vectorizer.get_feature_names()

# find my words in this feature_name list, and its corresponding index
print("search query is: ")
print(new_vocab)
for words in new_vocab:
    idx = feature_name.index(words)
    query_vector_ori[idx] = idf[idx]
# do normalize
query_vector_ori = query_vector_ori/np.linalg.norm(query_vector_ori)

# step3, transfer the origin vector to low_dim space
k = 100
u, s, vt = svds(tfidf_matrix.T, k=k)  # transpose the tfidf_matrix, get item*document
#here k is the remaining dimension. could from 1 to (number of document-1)
# d_hat = s.inv*U.t*d    
s_dig = np.diag(s)
query_vector_low_dim = ((np.linalg.inv(s_dig)).dot(u.T)).dot(query_vector_ori)
# get query in low dim

# step4, compute the similarity
def calculate_simility(q1,q2):
    sim = q1.dot(q2)/(np.linalg.norm(q1)*np.linalg.norm(q2))
    return sim
sim = np.zeros(vt.shape[1])
for i in range(0,vt.shape[1]):
    sim[i] = calculate_simility(query_vector_low_dim,vt[:,i])

# step5, take top 10 similar document
top_idx = np.argsort(-sim)[0:10]  # here -sim, since I want t get decending order sort,and get the top 3 index
print('------------------------------------')
print('related document: \t related score')
for i in top_idx:
    print(my_doc_name[i]+':\t'+ str(sim[i]))

# try to find some way to connect document and this index

search query is: 
{'sportbekleidung', 'schuh'}
------------------------------------
related document: 	 related score
PUMA-QuarterlyReport-2012-Q3.json:	0.9705766715598944
PUMA-QuarterlyReport-2012-Q2.json:	0.9584773368670201
PUMA-QuarterlyReport-2015-Q1.json:	0.9543901309787146
PUMA-QuarterlyReport-2014-Q3.json:	0.9507635544600765
PUMA-QuarterlyReport-2010-Q2.json:	0.9435183542968074
PUMA-QuarterlyReport-2011-Q3.json:	0.9304272544834216
PUMA-QuarterlyReport-2010-Q1.json:	0.9128649922630564
PUMA-AnnualReport-2013.json:	0.8812087556023984
Adidas-AnnualReport-2016.json:	0.3034457396028766
Zalando-AnnualReport-2015.json:	0.21861967178319974
