# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *O*

**Names:**

* *Argelaguet Franquelo, Pau*
* *du Bois de Dunilac, Vivien*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import operator
import numpy as np
from scipy.sparse.linalg import svds
from nltk.stem import SnowballStemmer
from functools import reduce

from utils import load_pkl

In [2]:
dat = load_pkl("data/preprocess.pckl")
mat = load_pkl("data/mat.pckl")
terms = load_pkl("data/terms.pckl")
documents = load_pkl("data/documents.pckl")

In [3]:
M, N = mat.shape

In [4]:
stemmer = SnowballStemmer("english")

## Exercise 4.4: Latent semantic indexing

In [5]:
# Getting SVD for term-document matrix
u, s, vt = svds(mat, k=300)

In [6]:
print("U:\n{}\n".format(u))
print("S:\n{}\n".format(s))
print("Vt:\n{}\n".format(vt))

U:
[[-9.68887812e-04  8.98294797e-04  2.00651142e-03 ... -2.11339226e-04
  -2.55484042e-04  3.82873645e-04]
 [ 4.39596334e-04 -8.23695943e-03 -2.64177604e-03 ... -1.29299593e-03
  -1.46493875e-03  1.41270066e-03]
 [ 8.08574020e-04 -2.08109150e-03  5.79369170e-05 ...  5.52254194e-04
   2.55659605e-04  5.37726825e-04]
 ...
 [-1.24180018e-02  6.16235786e-03  2.26840193e-02 ... -1.75524500e-03
  -7.41692901e-03  5.03089230e-03]
 [ 6.24752249e-04  2.15742698e-03 -2.67469022e-04 ...  1.92618739e-03
   1.07799377e-03  1.12821616e-03]
 [-1.53190533e-03  1.08578184e-03  7.50685750e-03 ... -4.77854170e-03
  -4.83765771e-03  5.71071883e-03]]

S:
[ 8.26214376  8.29163895  8.31116009  8.31141847  8.32749439  8.35467656
  8.36125285  8.38857632  8.4073862   8.42911247  8.43873641  8.45066588
  8.45416843  8.47028827  8.49684213  8.52436349  8.5279323   8.53262854
  8.5600703   8.5887046   8.59097322  8.59880867  8.62884505  8.64518168
  8.65053714  8.6575309   8.68010666  8.69884119  8.70434363  8.7

As any regular singular value decomposition, we have three matrices that come out decomposing X:

* **U** is a matrix with columns representing concepts about terms, that is, each row is a term and the value is the weight of that term in the given concept.
* **V** is similar to U but instead of relating concepts with terms, it relates them with documents. Note that we're using its transposed version, Vt.
* **S** is a diagonal matrix of singular values, each one of which describe the importance of the given concept.

In [7]:
eigvals = np.power(s, 2)[::-1][:20]
print("Top-20 eigenvalues of X:")
eigvals

Top-20 eigenvalues of X:


array([1119.07433225,  561.80110877,  521.15475805,  463.04078001,
        455.37147556,  443.63345395,  410.04369745,  375.29091879,
        369.017892  ,  350.22683623,  346.62096847,  340.8283083 ,
        334.4391275 ,  328.28873629,  323.09012486,  316.66547214,
        305.6653677 ,  303.90386152,  297.79454853,  294.16578248])

## Exercise 4.5: Topic extraction

In [10]:
# Take the top 10 concepts (the ones with highest singular value) from u and from v
top_u = u.T[::-1][:10]
top_v = vt[::-1][:10]

# For each concept, get the 15 terms and documents with highest value, that is, the one that have most influence on each of them
for i, (t, d) in enumerate(zip(top_u, top_v)):
    ts = map(lambda x: terms[x], t.argsort()[::-1][:15])
    ds = map(lambda x: dat[documents[x]]['name'], d.argsort()[::-1][:15])
    print("Dimension", i + 1, "\n=====")
    print("Terms:\n", list(ts))
    print("Documents:\n", list(ds))
    print("\n")

Dimension 1 
=====
Terms:
 ['data', 'problem', 'evalu', 'report', 'comput', 'electron', 'skill', 'mechan', 'research', 'engin', 'scientif', 'cell', 'discuss', 'plan', 'applic']
Documents:
 ['Lab immersion III', 'Advanced materials for photovoltaics and lighting', 'Cellular biology and biochemistry for engineers', 'History of globalization I', 'Project in bioengineering and biosciences', 'Bioprocesses and downstream processing', 'Experimental biochemistry and biophysics', 'MINTT: Management of Innovation and technology transfer (EDOC)', 'Lab immersion I', 'Philosophy of life sciences I', 'Principles of finance', 'Nanobiotechnology and biophysics', 'Numerical methods in heat transfer', 'Quantitative methods in finance', 'Particle-based methods']


Dimension 2 
=====
Terms:
 ['financi', 'financ', 'price', 'risk', 'valuat', 'data', 'market', 'corpor', 'stochast', 'option', 'capit', 'firm', 'deriv', 'manag', 'optim']
Documents:
 ['Principles of finance', 'Introduction to finance (IF master 

0. Life sciences

1. Finance

2. Mathematical finance

3. Algorithms

4. Biology

5. Imaging

6. Laboratory

7. Chemistry

8. Energy

9. Power

## Exercise 4.6: Document similarity search in concept-space

In [15]:
# Turning s (vector) into S (diagonal matrix)
S = np.diag(s)

In [17]:
# Comupte similarity between two vectors using the provided formula
def compute_sim(ut, vd):
    return np.dot(ut, S.dot(vd)) / (np.linalg.norm(ut) * np.linalg.norm(S.dot(vd)))


# Get document similar to a text 
def search(t):
    # Splitting and stemming input text into words
    search_terms = list(map(stemmer.stem, t.split()))
    
    # Constructing q vector, using U weights of all its terms
    q = sum(map(lambda x: u[terms.index(x)], search_terms))
    
    # Computing similarities of all columns (documents) with given query
    sims = [compute_sim(q, d) for d in vt.T]
    sims = {dat[documents[i]]['name']: x for i, x in enumerate(sims) if x > 0}  
    sims = sorted(sims.items(), key=operator.itemgetter(1), reverse=True)
    return sims

In [18]:
print("Top 5 courses for markov chains:")
search('markov chains')[:5]

Top 5 courses for markov chains:


[('Applied stochastic processes', 0.8609327376522445),
 ('Applied probability & stochastic processes', 0.7771197541178925),
 ('Markov chains and algorithmic applications', 0.6557304950301164),
 ('Supply chain management', 0.5425899859425819),
 ('Statistical Sequence Processing', 0.4943533254452211)]

In [19]:
print("Top 5 courses for facebook:")
search('facebook')[:5]

Top 5 courses for facebook:


[('Computational Social Media', 0.9351569084095054),
 ('Social media', 0.7693572030048721),
 ('Media security', 0.35814100943358684),
 ('Advanced principles and applications of systems biology',
  0.2215975645614137),
 ('Image communication', 0.19981436592184496)]

For markov chains, we can see that we get similar results as previous part, but with different scores. With facebook though, we see that now we get more than one result. In both cases, that is because we're no longer using explicit words but the concepts they represent.

## Exercise 4.7: Document-document similarity

In [21]:
def compute_doc_sim(d1, d2):
    d1S = S.dot(d1)
    d2S = S.dot(d2)
    return d1S.dot(d2S) / (np.linalg.norm(d1S) * np.linalg.norm(d2S))

In [22]:
doc = documents.index('COM-308')

v = vt.T

sims = [compute_doc_sim(v[doc], d) for d in v]
sims = {dat[documents[i]]['name']: x for i, x in enumerate(sims) if x > 0}  
sims = sorted(sims.items(), key=operator.itemgetter(1), reverse=True)

In [23]:
print("Top classes similar to Internet Analytics:")
sims[:6]

Top classes similar to Internet Analytics:


[('Internet analytics', 1.0000000000000002),
 ('A Network Tour of Data Science', 0.4474673072882506),
 ('Applied data analysis', 0.39831253596762267),
 ('Computational Social Media', 0.34260955208146177),
 ('Database systems', 0.3333640338430118),
 ('Digital education & learning analytics', 0.3124855139952114)]