# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *O*

**Names:**

* *Argelaguet Franquelo, Pau*
* *du Bois de Dunilac, Vivien*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import operator
import numpy as np
from scipy.sparse.linalg import svds
from nltk.stem import SnowballStemmer
from functools import reduce

from utils import load_pkl

In [2]:
dat = load_pkl("data/preprocess.pckl")
mat = load_pkl("data/mat.pckl")
terms = load_pkl("data/terms.pckl")
documents = load_pkl("data/documents.pckl")

In [3]:
M, N = mat.shape

In [4]:
stemmer = SnowballStemmer("english")

## Exercise 4.4: Latent semantic indexing

In [5]:
# Getting SVD for term-document matrix
u, s, vt = svds(mat, k=300)

In [6]:
print("U:\n{}\n".format(u))
print("S:\n{}\n".format(s))
print("Vt:\n{}\n".format(vt))

U:
[[ 4.75531425e-04 -1.42264743e-03  3.48438305e-04 ...  1.59905513e-04
  -3.75918352e-04  3.47056934e-04]
 [ 6.63699584e-03  7.40679389e-05 -9.91013840e-03 ...  3.67202019e-04
  -6.91237574e-04  1.23154955e-03]
 [ 4.43642287e-04 -1.08911005e-03 -1.09324994e-03 ... -3.45563402e-04
   2.20083864e-04  4.84372219e-04]
 ...
 [-4.86839273e-03 -1.39770492e-03  2.67787422e-03 ... -7.86812776e-04
   8.20777860e-04  1.01081548e-03]
 [-4.86839273e-03 -1.39770492e-03  2.67787422e-03 ... -7.86812776e-04
   8.20777860e-04  1.01081548e-03]
 [ 1.56029753e-02 -1.52466702e-03 -2.91512182e-03 ...  2.30137649e-03
  -4.33040952e-03  5.62151557e-03]]

S:
[10.21018259 10.22389056 10.22764907 10.24104833 10.26100639 10.27784965
 10.30843079 10.31586493 10.32829224 10.34992689 10.35682613 10.3707739
 10.39050684 10.41472023 10.42579165 10.44747449 10.45200045 10.46057364
 10.49742753 10.51434788 10.54982812 10.56413699 10.56554248 10.57418803
 10.58232244 10.60515192 10.61121034 10.61761854 10.6535142  10.67

As any regular singular value decomposition, we have three matrices that come out decomposing X:

* **U** is a matrix with columns representing concepts about terms, that is, each row is a term and the value is the weight of that term in the given concept.
* **V** is similar to U but instead of relating concepts with terms, it relates them with documents. Note that we're using its transposed version, Vt.
* **S** is a diagonal matrix of singular values, each one of which describe the importance of the given concept.

In [7]:
eigvals = np.power(s, 2)[::-1][:20]
print("Top-20 eigenvalues of X:")
eigvals

Top-20 eigenvalues of X:


array([1218.12814501,  686.71420463,  623.49994375,  614.28696605,
        590.00334217,  543.37560286,  529.60266146,  518.82764824,
        490.19076036,  457.1424072 ,  445.83407473,  434.75528609,
        428.84678037,  422.35330548,  413.77961282,  405.24692545,
        402.82697685,  397.11683637,  388.74015526,  383.75983474])

## Exercise 4.5: Topic extraction

In [8]:
# Take the top 10 concepts (the ones with highest singular value) from u and from v
top_u = u.T[::-1][:10]
top_v = vt[::-1][:10]

# For each concept, get the 15 terms and documents with highest value, that is, the one that have most influence on each of them
for i, (t, d) in enumerate(zip(top_u, top_v)):
    ts = map(lambda x: terms[x], t.argsort()[::-1][:15])
    ds = map(lambda x: dat[documents[x]]['name'], d.argsort()[::-1][:15])
    print("Dimension", i + 1, "\n=====")
    print("Terms:\n", list(ts))
    print("Documents:\n", list(ds))
    print("\n")

Dimension 1 
=====
Terms:
 ['data', 'problem', 'evalu', 'report', 'skill', 'comput', 'electron', 'cell', 'studi', 'scientif', 'research', 'engin', 'mechan', 'plan', 'applic']
Documents:
 ['Advanced materials for photovoltaics and lighting', 'Lab immersion III', 'MINTT: Management of Innovation and technology transfer (EDOC)', 'History of globalization I', 'Lab immersion I', 'Project in bioengineering and biosciences', 'Philosophy of life sciences I', 'Cellular biology and biochemistry for engineers', 'Bioprocesses and downstream processing', 'Principles of finance', 'Experimental biochemistry and biophysics', "CCMX Winter School - Additive Manufacturing of Metals and the Material Science Behind It'", 'Quantitative methods in finance', 'Particle-based methods', 'Nanobiotechnology and biophysics']


Dimension 2 
=====
Terms:
 ['data', 'financi', 'financ', 'price', 'skill', 'risk', 'plan', 'evalu', 'optim', 'valuat', 'experiment', 'problem', 'corpor', 'market', 'scientif']
Documents:
 ['P

0. Life sciences

1. Finance

2. Mathematical finance

3. Algorithms

4. Biology

5. Imaging

6. Laboratory

7. Chemistry

8. Energy

9. Power

## Exercise 4.6: Document similarity search in concept-space

In [9]:
# Turning s (vector) into S (diagonal matrix)
S = np.diag(s)

In [10]:
# Comupte similarity between two vectors using the provided formula
def compute_sim(ut, vd):
    return np.dot(ut, S.dot(vd)) / (np.linalg.norm(ut) * np.linalg.norm(S.dot(vd)))


# Get document similar to a text 
def search(t):
    # Splitting and stemming input text into words
    search_terms = list(map(stemmer.stem, t.split()))
    bigrams = list(map(lambda x: x[0] + " " + x[1], zip(search_terms, search_terms[1:])))
    search_terms = search_terms + bigrams
    
    # Constructing q vector, using U weights of all its terms
    q = sum(map(lambda x: u[terms.index(x)], search_terms))
    
    # Computing similarities of all columns (documents) with given query
    sims = [compute_sim(q, d) for d in vt.T]
    sims = {dat[documents[i]]['name']: x for i, x in enumerate(sims) if x > 0}  
    sims = sorted(sims.items(), key=operator.itemgetter(1), reverse=True)
    return sims

In [11]:
print("Top 5 courses for markov chains:")
search('markov chains')[:5]

Top 5 courses for markov chains:


[('Applied stochastic processes', 0.8493125140393682),
 ('Applied probability & stochastic processes', 0.7683334599633568),
 ('Markov chains and algorithmic applications', 0.7062631524329015),
 ('Supply chain management', 0.3643857150073063),
 ('Mathematical models in supply chain management', 0.33746631152875783)]

In [12]:
print("Top 5 courses for facebook:")
search('facebook')[:5]

Top 5 courses for facebook:


[('Computational Social Media', 0.9616706165470982),
 ('Social media', 0.8789927989448072),
 ('Media security', 0.28832855429967114),
 ('Networks out of control', 0.21226715753443506),
 ('Advanced principles and applications of systems biology',
  0.19201882496768302)]

For markov chains, we can see that we get similar results as previous part, but with different scores. With facebook though, we see that now we get more than one result. In both cases, that is because we're no longer using explicit words but the concepts they represent.

## Exercise 4.7: Document-document similarity

To calculate document-document similarity, we simply take the existing representation of documents as a vectors and comute their cosine similarity. Then, we sort similarities to get the most similar documents to IX course.

In [13]:
def compute_doc_sim(d1, d2):
    d1S = S.dot(d1)
    d2S = S.dot(d2)
    return d1S.dot(d2S) / (np.linalg.norm(d1S) * np.linalg.norm(d2S))

In [14]:
doc = documents.index('COM-308')

v = vt.T

sims = [compute_doc_sim(v[doc], d) for d in v]
sims = {dat[documents[i]]['name']: x for i, x in enumerate(sims) if x > 0}  
sims = sorted(sims.items(), key=operator.itemgetter(1), reverse=True)

In [15]:
print("Top classes similar to Internet Analytics:")
sims[:6]

Top classes similar to Internet Analytics:


[('Internet analytics', 1.0),
 ('Distributed information systems', 0.41504065554504316),
 ('Applied data analysis', 0.3982774061444288),
 ('A Network Tour of Data Science', 0.39332022400583666),
 ('Digital education & learning analytics', 0.3259200480581976),
 ('Financial big data', 0.32077327378292825)]