# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *B*

**Names:**

* *Vincenzo Bazzucchi*
* *Amaury Combes*
* *Alexis Montavon*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from scipy.sparse import csr_matrix

## Exercise 4.4: Latent semantic indexing

In [2]:
loader = np.load('tfidf.npz')
tdm = csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'], dtype=np.float)

with open('terms.txt', 'r') as f:
    terms = [t.strip('\n') for t in f]

with open('courses.txt', 'r') as f:
    courses = [c.strip('\n') for c in f]

In [3]:
U, s, Vt = svds(tdm, k=300)

In [4]:
s[-20:][::-1]

array([ 59.01271033,  42.69280749,  36.27004754,  33.81421925,
        33.18910907,  32.6155604 ,  31.56403462,  30.69270191,
        30.22463554,  29.62774003,  29.39274378,  29.12005698,
        28.82328777,  28.3323923 ,  28.02012071,  27.46953019,
        27.353118  ,  27.16420203,  27.00796738,  26.66922338])

$U$ has a row per term and a column per topic. Its columns describe a basis for the terms in the space of the 300 topics. Each row maps a term in the new space. Each row $i$ corresponds to a term and each element $i,j$ tells us to what extent the term $i$ is relevant to topic $j$.

$V$ has a row per course (document) a and a column per topic. Its columns describe a basis for the courses in the space of the 300 documents. Each row $i$ corresponds to a course and each element $i,j$ of the rows tells us to what extent the topic $j$ is present in document (course) $i$.

Finally `np.diag(s)` is a diagonal matrix which has one element per topic. The element $i,i$ of the matrix gives the importance of topic $i$ in the document corpus.

## Exercise 4.5: Topic extraction

In [5]:
"""
Sort vectors and values by decreasing order and only keep the first 10.
"""
top10left = np.fliplr(U)[:,:10]
top10right = np.flipud(Vt).T[:,:10]

In [6]:
top10right.shape

(854, 10)

In [7]:
keep_first = lambda list_of_tuples: [t[0] for t in list_of_tuples] # Only keep the first element of tuples in tuple list
topics = {}
for topic in range(10):
    top_words = sorted([(terms[i], top10left[i, topic]) for i in range(top10left.shape[0])],
                       key=lambda t: t[1],
                       reverse=True)[:10]
    top_courses = sorted([(courses[i], top10right[i, topic]) for i in range(top10right.shape[0])],
                         key=lambda t: t[1],
                         reverse=True)[:10]
    topics[topic] = {'terms': keep_first(top_words), 'courses': keep_first(top_courses)}

In [8]:
from pprint import pprint
pprint(topics)

{0: {'courses': ['Innovation management',
                 'Nanoelectronics',
                 'Eco-friendly production and process intensification',
                 'Nonlinear Optics',
                 'Machine Learning for Engineers',
                 'Biomechanics of the musculoskeletal system',
                 'Production management',
                 'Current Topics in Chemical Biology 2',
                 'Atmospheric Boundary Layer',
                 'Investments'],
     'terms': ['transmits',
               'redistribution',
               'metamorphism',
               'pack',
               'vegetation',
               'avalanches',
               'genesis',
               'sectionexplain',
               'saphirre',
               'oscillationknow']},
 1: {'courses': ['Difficult Double Double Histories',
                 'Technology & innovation strategy',
                 'Industrial automation',
                 'Product lifecycle management - concepts methods and tools'

**Labels:**

0) Environmental science

1) Lab/Experiment

2) Finance

3) Mecanical engineering

4) Chemistry

5) Electronics

6) Life Sciences

7) Bio-engineering

8) Micro engineering

9) Physics

## Exercise 4.6: Document similarity search in concept-space

In [9]:
term_directory = {terms[i]: i for i in range(len(terms))}
S = np.diag(s)

def sim(w_indexes, j):
    """
    Computes the similarity between the terms with *indexes* w_indexes and
    the document (course) with *index* j
    
    w_indexes -- The list of indexes of the terms
    j -- The index of the document
    """
    temp = 0
    vj = Vt[:,j] #j-th column of V^T <=> j-th row of V
    for i in w_indexes:
        ui = U[i, :] #i-th row of U
        temp += (ui @ S @ vj.T) / (np.linalg.norm(ui) * np.linalg.norm(S @ vj))
    return temp

def search(keywords):
    """
    Finds the indexes of the 5 documents most relevant to the query
    
    keywords -- list of keywords to search
    """
    w_indexes = [term_directory[w] for w in keywords]
    doc_to_sim = [(doc, sim(w_indexes, doc)) for doc in range(Vt.shape[1])] # (course_index, similarity) array
    doc_to_sim.sort(key=lambda t: t[1], reverse=True)
    return [(courses[t[0]], t[1]) for t in doc_to_sim[:5]]

In [10]:
pprint(search('markov chain'.split(' ')))

[('Computational finance', 1.4489563427083438),
 ('Integrated circuits technology', 1.3449336874694562),
 ('Media security', 1.1600939699858928),
 ('Internet analytics', 1.0266601960879465),
 ('Particle-based methods', 0.82656599302150857)]


In [11]:
pprint(search(['facebook']))

[('Martingales in financial mathematics', 0.92341057133120974),
 ('Interdisciplinary / disciplinary project for chemical master',
  0.66106638084504554),
 ('Software engineering', 0.51636209660108612),
 ('Molecular endocrinology', 0.43929443794931894),
 ('Database systems', 0.43285700621746326)]


We can see that the results are very different from the one we optained in the first Notebook (EX. 4.3). This is mainly due to the fact that the similarity was calculated with each term beeing independant from one another. With LSI we include near-synonymity. Hence the results are not simple classes that contain the searched terms in their descriptions.

## Exercise 4.7: Document-document similarity

The rows of matrix $V$ describe a basis for the courses in the lower dimensional ($K=300$) space. Each row can be interpreted as the "profile" of a course.

To compare the similarity between two courses we can use the *cos-similarity*:

$$ sim(d_i, d_j) = \frac{v^T_i \cdot v_j}{||v_i|| \cdot ||v_j||}$$

where $v_k$ is the $k$-th row of matrix $V$. However the rows of $V$, which are equivalent to the columns of $V^T$, are already normalized, so the calculation simply becomes:

$$ sim(d_i, d_j) = v^T_i \cdot v_j $$

In [12]:
com_308 = -1
for index, course_name in enumerate(courses):
    if course_name == 'Internet analytics':
        com_308 = index
        break

In [13]:
course_sim = lambda i, j: Vt[:,i].T @ Vt[:, j]
most_similar = [(name, course_sim(com_308, i)) for i, name in enumerate(courses[:len(courses)-1]) if i != com_308]
most_similar.sort(key=lambda t: t[1], reverse=True)
pprint(most_similar[:5])

[('Macrofinance', 0.19602937299465739),
 ('Speech processing', 0.12943128184625616),
 ('Mechanical product design and development', 0.090962756970695766),
 ('Computational finance', 0.063905920088881796),
 ('Early detection of innovation potential', 0.061124624588481162)]
