# LSA (M) Notes

## Distances
I'm going to include code for determining the euclidean and cosine distances between two terms in a term-document matrix below.

### Euclidean Distance
Euclidean distances uses a modified form of the Pythagorean theorum to measure how far apart (how similar) terms are in a term-document matrix.  This relies on a conceptual mapping of the terms in a corpus onto vector space so that they have the properties of vectors (direction, magnitude or location) which enables the use of geometric principles to discover underlying linguistic patterns in corpora.

In [6]:
from math import sqrt, pow, log
import pandas as pd
from numpy import sum, linalg
from operator import itemgetter

df = pd.read_csv('shakespeare.csv', delimiter=',', encoding='latin-1')
df2 = df.set_index('Unnamed: 0')

def euclidDist(df, term):
    termVec = df.loc[str(term)] #Get term row
    distances = {}
    for index, r in df.iterrows():
        dist = 0
        for j in range(1,len(r)):
            if term != index:
                dist = dist + pow(r[j] - termVec[j], 2)
        distances[str(index)] = sqrt(dist)
    return(distances)

battleDistances = euclidDist(df2, 'battle')
topBattle = sorted(battleDistances.items(), key=itemgetter(1))[1:10]
print("The words that are most similar to Battle in the Shakespeare Corpus:")
for key, value in topBattle:
    print(key + " " + str(value))

The words that are most similar to Battle in the Shakespeare Corpus:
march 17.832554500127006
army 18.708286933869708
field 19.72308292331602
throne 19.974984355438178
victory 20.0
grant 20.024984394500787
mighty 20.149441679609886
courage 20.346989949375804
yield 20.808652046684813


The Euclidean distances presented above can be understood as the diagonal distance between the vectors of each word across the corpus; the farther a word is from another word, the greater the magnitude of the distance.

### Cosine Distance
The problem with Euclidean distance is that it is a simple distance measure between the two vectors that does not take into account that the vectors may be of different lengths.  Therefore, Euclidean distances between underrepresented or overrepresented terms can skew the magnitude of the relationships between terms (unrelated terms can be really close together if they're both short vectors).