# Laten Semantic Analysis

Latent Semantic Analysis (LSA) is a method of text analysis 
that helps to identify concepts represented in the text as related words
In this example, we both calculate document representation in the 
concept space and score documents againt the query using a distance metric in this space.

| Description | See [Introduction to Algorithmic Marketing](https://algorithmicweb.wordpress.com/ ) book |
|--|:--|
| Dataset | Synthetic |
| Libs | Sympy, Numpy |

In [2]:
%matplotlib inline
import sympy as sy
import numpy as np
import matplotlib.pyplot as plt
from itertools import chain 
from tabulate import tabulate
from collections import Counter

def tabprint(msg, A):
    print(msg)
    print(tabulate(A, tablefmt="fancy_grid"))

In [3]:
docs = [
"chicago chocolate retro candies made with love",
"chocolate sweets and candies collection with mini love hearts",
"retro sweets from chicago for chocolate lovers"]

In [4]:
# Basic analyzer: 
# - split documents into words
# - remove stop words
# - apply a simple stemmer
analyzer = {
    "with": None,
    "for": None,
    "and": None,
    "from": None,
    "lovers": "love",
    "hearts": "heart"
}
bag_of_words_docs = [list(filter(None, [analyzer.get(word, word) for word in d.split()])) for d in docs]

In [6]:
# Create term frequency matrix
unique_words = list(set(chain.from_iterable(bag_of_words_docs)))
word_freq = [Counter(d) for d in bag_of_words_docs]
A = np.array([[freq.get(word, 0) for freq in word_freq] for word in unique_words])
for i, word in enumerate(unique_words):
    print("%10s %s" % (word, str(A[i])))  

   candies [1 1 0]
collection [0 1 0]
 chocolate [1 1 1]
     heart [0 1 0]
   chicago [1 0 1]
    sweets [0 1 1]
      made [1 0 0]
     retro [1 0 1]
      love [1 1 1]
      mini [0 1 0]


In [10]:
# Perform truncated SVD decomposition 
U, s, V = np.linalg.svd(A, full_matrices=False)
truncate_rank = 2
Ut = U[:, 0:truncate_rank]
Vt = V[0:truncate_rank, :]
St = np.diag(s[0:truncate_rank])
reconstruction = np.dot(Ut, np.dot(St, Vt))
tabprint("Ut =", Ut)
tabprint("St =", St)
tabprint("Vt =", Vt)
tabprint("Ut x St x Vt =", np.round(reconstruction))

Ut =
╒═══════════╤════════════╕
│ -0.333888 │ -0.148947  │
├───────────┼────────────┤
│ -0.167762 │ -0.4059    │
├───────────┼────────────┤
│ -0.485799 │  0.0183087 │
├───────────┼────────────┤
│ -0.167762 │ -0.4059    │
├───────────┼────────────┤
│ -0.318037 │  0.424208  │
├───────────┼────────────┤
│ -0.319674 │ -0.238644  │
├───────────┼────────────┤
│ -0.166126 │  0.256953  │
├───────────┼────────────┤
│ -0.318037 │  0.424208  │
├───────────┼────────────┤
│ -0.485799 │  0.0183087 │
├───────────┼────────────┤
│ -0.167762 │ -0.4059    │
╘═══════════╧════════════╛
St =
╒═════════╤═════════╕
│ 3.56192 │ 0       │
├─────────┼─────────┤
│ 0       │ 1.96588 │
╘═════════╧═════════╛
Vt =
╒═══════════╤═══════════╤═══════════╕
│ -0.591727 │ -0.597555 │ -0.541097 │
├───────────┼───────────┼───────────┤
│  0.505138 │ -0.79795  │  0.328805 │
╘═══════════╧═══════════╧═══════════╛
Ut x St x Vt =
╒════╤════╤═══╕
│  1 │  1 │ 1 │
├────┼────┼───┤
│ -0 │  1 │ 0 │
├────┼────┼───┤
│  1 │  1 │ 1 │
├────┼─

In [8]:
# Project a query to the concept space and score documents
query = "chicago"
q = [int(query == word) for word in unique_words]
qs = np.dot(q, np.dot(Ut, np.linalg.inv(St)))

def score(query_vec, doc_vec):
    return np.dot(query_vec, doc_vec) / ( np.linalg.norm(query_vec) * np.linalg.norm(doc_vec) )

for d in range(len(docs)):
    print("Document %s score: %s" % (d, score(qs, Vt[:, d])))

Document 0 score: 0.890730150933
Document 1 score: -0.51043666768
Document 2 score: 0.806592806364
