# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *Your group letter.*

**Names:**

* *Name 1*
* *Name 2*
* *Name 3*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

U = M ×K;  S = K ×K; V = K ×N , K rank

In [1]:
import pickle
import numpy as np
from scipy.sparse import load_npz
from scipy.sparse.linalg import svds
from utils import load_json


## Exercise 4.4: Latent semantic indexing

In [8]:
data = load_npz('tf_idf.npz')
courses = load_json('data/courses.txt')
index_terms = load_json('dict_index_word.txt')

In [9]:
index_terms = index_terms[0]

In [10]:
K=300
U, S, V = svds(data, K)

for i in range(1,21):
    print(S[-i])

95.08828388714933
92.52086463673743
88.3883485765918
85.20283555746113
83.43562837221057
83.04422329354485
80.08661731300825
79.44790102019806
78.32118326518545
77.80732117558594
77.55624700367134
77.09396702889406
76.76352918962567
76.20642924256359
75.67076932146176
75.39385613492838
75.19340246546489
74.71228252450693
72.67555141052131
72.26136661351383


The values of Si represent the strength of topic i. 

The columns of U represent the influence of topic i on the columns of the data, in our case the documents.

The rows of V represent the influence of topic i on the rows of the data, in our case the terms.

## Exercise 4.5: Topic extraction

In [13]:
documents = []   # 10X10 10 topics for each 10 documents
terms = []       # 10X10 10 topics for each 10 terms

for i in range(1,11):
    #column of U (Mx1): for topic i wich of this M Terms represent i more
    uCol = U[:,-i]
    #row de V (Nx1): for topic i wich of this N Documents represent i more
    vRow = V[-i, :]
    
    indexTerm = np.argsort(uCol)[::-1]
    terms.append(indexTerm[0:10])
    indexDoc = np.argsort(vRow)[::-1]
    documents.append(indexDoc[0:10])


        
#Give a label to each of them 

In [16]:
for i in range(10):
    print( "Topic " + str(i) + ":")
    
    print( "   - Documents (Courses) :")
    for j in documents[i]:
        print("   " + courses[j]['name'])
        
    print("\n")
    print( "   - Terms (words) :")
    for k in terms[i]:
        print("   " + index_terms[str(k)])
    print("\n")
        
#Give a label to each of them 

Topic 0:
   - Documents (Courses) :
   Field Research Project A
   Project 1 (EDIC)
   Field Research Project B
   Project 2 (EDIC)
   De- and re-regulation of Network Industries
   Training Rotation (EDNE)
   PLLs and clock & data recovery
   Studio BA4 (De Vylder & Taillieu)
   Magnetic confinement
   Multidisciplinary organization of medtechs/biotechs


   - Terms (words) :
   edmt administration
   contact edmt
   administration enrollment
   ic laboratory
   semester ic
   assignment link
   syllabus including
   link note
   note taught
   taught regulation


Topic 1:
   - Documents (Courses) :
   Lab immersion III
   Lab immersion I
   Lab immersion II
   Semester project in Bioengineering
   Project in bioengineering and biosciences
   Project in neuroprosthetics
   Lab immersion in industry A
   Lab immersion academic (outside EPFL) A
   Lab immersion academic (outside EPFL) B
   Lab immersion in industry B


   - Terms (words) :
   experimentation data
   wetlab computational

TODO <br>
Topic 0: Administration IC <br>
Topic 1: Chimistry Lab <br>
Topic 2: ? <br>
Topic 3: Topic 1 <br>
Topic 4: Machine Learning <br>
Topic 5: Topic4 <br>
Topic 6: ? <br>
Topic 7: Classification <br>
Topic 8: ? <br>
Topic 9: ?

## Exercise 4.6: Document similarity search in concept-space

For the facebook query, as we have 6 possible terms where facebook appear,
we have done the average of their term vector (Ut) from the matrix U to 
compute the similarityproduct.

In [17]:
def sim(termUt, docuVd):
    prodSV = S * docuVd
    prod = termUt.T @ (prodSV)
    norm1 = np.linalg.norm(termUt)
    norm2 = np.linalg.norm(prodSV)
    return prod / (norm1 * norm2)

In [18]:
facebook_idxs =[22303, 22304, 22305, 54060, 57412, 66176]
UtFacebookVectors = []

for i in facebook_idxs:
    UtFacebookVectors.append(U[i,:])
UtFacebookVectors = np.asarray(UtFacebookVectors)
UtFacebook = UtFacebookVectors.mean(axis=0)

In [19]:
#sim() function
McIndex = 35957 
UtMc = U[McIndex, :]

In [20]:
McSimilitude = []
FbSimilitude = []
for i in range(len(courses)):
    Vd = V[:, i]
    McSimilitude.append(sim(UtMc, Vd))
    FbSimilitude.append(sim(UtFacebook, Vd))


In [21]:
top5CoursesMc = np.argsort(McSimilitude)[::-1][:5]
top5CoursesFb = np.argsort(FbSimilitude)[::-1][:5]

In [22]:
print( "Top 5 Courses (similitude with Markoc Chain) :")
for j in top5CoursesMc:
    print("   " + courses[j]['name'] + f" ( {McSimilitude[j]})")

print(" =====================================")
print( "Top 5 Courses (similitude with Facebook) :")
for j in top5CoursesFb:
    print("   " + courses[j]['name'] + f" ( {FbSimilitude[j]})")

Top 5 Courses (similitude with Markoc Chain) :
   Markov chains and algorithmic applications ( 0.7495848317908491)
   Optimization and simulation ( 0.721219545739311)
   Applied stochastic processes ( 0.7195993713917974)
   Internet analytics ( 0.674083072315756)
   Computational Social Media ( 0.6035083143311233)
Top 5 Courses (similitude with Facebook) :
   Computational Social Media ( 0.9759045495646025)
   Internet analytics ( 0.947551408352123)
   Distributed information systems ( 0.4948218469422557)
   Networks out of control ( 0.43180139257320505)
   Personal interaction studio ( 0.407806718554651)


In [None]:
# TODO Compare with Part1

## Exercise 4.7: Document-document similarity

In [23]:
def simDoc(doc1,doc2):
    dot_doc = doc1.T @ doc2
    norm1 = np.linalg.norm(doc1)
    norm2 = np.linalg.norm(doc2)
    return dot_doc / (norm1 * norm2)

In [24]:
COM308index = 43
COM308Vector = V[:, 43]

In [25]:
COM308Similitude = []
for i in range(len(courses)):
    Vd = V[:, i]
    COM308Similitude.append(simDoc(COM308Vector, Vd))

In [26]:
#the first one is Internet Analytics so I start from 1 an not 0 to get rid of it
top5CoursesCOM308 = np.argsort(COM308Similitude)[::-1][1:6]  

In [28]:
print( "Top 5 Courses similar to Internet Analytic course :")
for j in top5CoursesCOM308:
    print("   " + courses[j]['name'] + f" ( {COM308Similitude[j]})")

Top 5 Courses similar to Internet Analytic course :
   Computational Social Media ( 0.9317045873618645)
   Distributed information systems ( 0.5436517067832451)
   Networks out of control ( 0.4478851321391735)
   Personal interaction studio ( 0.3959728685073792)
   Graph theory ( 0.3504919600926692)
