# Euclidean Distance

In [None]:
import pandas as pd
path = 'https://raw.githubusercontent.com/organisciak/Text-Mining-Course/master/data/contemporary_books/'
data = pd.read_csv(path + 'contemporary.csv', encoding='utf-8').set_index('book')
info = pd.read_csv(path + 'contemporary_labels.csv', encoding='utf-8')

Consider each book as coordinates in euclidean space. e.g. $mdp.39015005028686=\{aback=1, abagail=12, ..., zoo=5\}$

Here's our data:

In [None]:
data

Unnamed: 0_level_0,aback,abagail,abandon,abandoned,abandoning,abandonment,abducted,aberrations,abilities,ability,...,zipper,zippered,zippers,zipping,zombie,zombies,zone,zones,zoning,zoo
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
mdp.39015005028686,1.0,12.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,8.0,...,5.0,1.0,0.0,4.0,1.0,0.0,32.0,0.0,0.0,5.0
mdp.39015010763418,1.0,0.0,0.0,6.0,1.0,0.0,0.0,0.0,0.0,4.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mdp.39015027242315,4.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,3.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0,0.0,0.0
mdp.39015029244657,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,1.0,1.0
mdp.39015031703609,0.0,0.0,0.0,7.0,1.0,0.0,0.0,0.0,1.0,4.0,...,0.0,0.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,3.0
mdp.39015038148048,0.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0,1.0,3.0,...,0.0,0.0,1.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0
mdp.39015040702071,7.0,0.0,1.0,1.0,2.0,0.0,2.0,0.0,2.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mdp.39015043780249,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
mdp.39015043798936,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,1.0,7.0,...,1.0,0.0,0.0,0.0,1.0,0.0,33.0,0.0,2.0,0.0
mdp.39015046381565,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,1.0,8.0,...,1.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0


Here's what just the values look like for the first book, from $aback$ to $zoo$:

In [None]:
book1 = data.iloc[0].values
book1

array([  1.,  12.,   2., ...,   0.,   0.,   5.])

Scipy offers a set of distance functions in [scipy.spatial.distance](https://docs.scipy.org/doc/scipy-0.14.0/reference/spatial.distance.html), which can stand in for similarity: smaller distance is greater similarity.

We covered euclidean distance in class, but you can see a great many other options, such as cosine distance, jaccard dissimilarity, and correlation. 

Here's Euclidean distance. As a sanity check, confirm that the distance between identical books is zero:

In [None]:
from scipy.spatial.distance import euclidean
euclidean(book1, book1)

0.0

Good. How about between book1 (*The Stand*) and book2 (*Lady Oracle*):

In [None]:
book2 = data.iloc[1].values

In [None]:
euclidean(book1, book2)

3344.1303802334023

Currently, this doesn't have any context. Scipy has a pairwise function, `pdist`, that will compare every single book. e.g. book 1 with books 2..30, book 2 with books 3..30, and so on.

In [None]:
from scipy.spatial.distance import pdist, squareform
Y = pdist(data.values, 'euclidean')
squareform(Y)

array([[    0.        ,  3344.13038023,  3387.53169727,  3601.69668351,
         3728.66759044,  2275.62980293,  3096.18765581,  3976.7167362 ,
         2255.92397922,  3375.075851  ,  3728.80865693,  3629.98581264,
         3575.78648691,  2665.69653187,  3047.49602133,  3719.22263383,
         3681.11070195,  4436.66304783,  2894.71708462,  3828.3670148 ,
         4061.84773225,  2162.55913214,  3873.96515214,  4013.12571445,
         3798.22445361,  3693.81550703,  3867.72336136,  3683.02986684,
         2761.75053182,  3798.7786195 ,  3198.80102538],
       [ 3344.13038023,     0.        ,  1765.09178232,  1054.04601418,
         1765.91053001,  1824.72710288,  1002.60560541,  1141.33343069,
         1544.56822446,  1452.47616159,  1765.83634576,   973.80952963,
         1028.2417031 ,  1197.54832888,   953.15371268,   950.88853185,
          943.19139097,  1355.99225662,   974.39673645,   955.78867957,
         1139.2440476 ,  1756.66445288,  1128.40241049,  1071.47935118,
       

A lot of numbers. A DataFrame is easier to read, so convert to a dataframe and add the book titles as column and index names:

In [None]:
pd.DataFrame(squareform(Y), columns=info['title'], index=info['title'])

title,The stand,Lady oracle;,The robber bride,The pelican brief,The rainmaker,Desperation,Alias Grace,The girl who loved Tom Gordon,Bag of bones,A time to kill,...,Duma Key,The appeal,Carrie,Bodily harm,Cat's eye,Life before man,The king of torts (large print),The dark half,Stephen King's Danse macabre,Cujo
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The stand,0.0,3344.13038,3387.531697,3601.696684,3728.66759,2275.629803,3096.187656,3976.716736,2255.923979,3375.075851,...,2162.559132,3873.965152,4013.125714,3798.224454,3693.815507,3867.723361,3683.029867,2761.750532,3798.77862,3198.801025
Lady oracle;,3344.13038,0.0,1765.091782,1054.046014,1765.91053,1824.727103,1002.605605,1141.333431,1544.568224,1452.476162,...,1756.664453,1128.40241,1071.479351,1312.503333,1431.553352,1071.174122,965.113983,1096.740626,1582.280949,960.913107
The robber bride,3387.531697,1765.091782,0.0,1940.843631,1668.808857,2194.694512,1503.499584,2062.570484,1934.238868,2128.606117,...,2124.674799,2063.406165,2063.059621,1181.15198,1207.652268,1436.837152,1979.846459,1810.928767,2021.72649,1796.967724
The pelican brief,3601.696684,1054.046014,1940.843631,0.0,1432.581237,2023.880431,1463.105943,1139.940788,1810.056353,1192.015939,...,2056.459336,918.614718,1055.932289,1406.299399,1502.418717,1160.745019,856.54422,1292.50648,1613.00713,1093.434497
The rainmaker,3728.66759,1765.91053,1668.808857,1432.581237,0.0,2384.818442,1751.004854,1810.614813,2090.044736,1558.343672,...,2348.235082,1482.473609,1776.679487,1526.194286,1424.086023,1438.054241,1447.027643,1795.152361,1839.011963,1681.716385
Desperation,2275.629803,1824.727103,2194.694512,2023.880431,2384.818442,0.0,1825.976177,2236.273463,1129.255507,2067.315167,...,1261.434501,2288.921799,2293.361507,2284.147543,2225.446921,2247.607395,2124.667974,1275.256445,2364.343461,1614.080853
Alias Grace,3096.187656,1002.605605,1503.499584,1463.105943,1751.004854,1825.976177,0.0,1649.573884,1486.74443,1606.895454,...,1624.78614,1536.130854,1608.373091,1367.826378,1453.3599,1345.15501,1378.53219,1251.337684,1763.78315,1312.43057
The girl who loved Tom Gordon,3976.716736,1141.333431,2062.570484,1139.940788,1810.614813,2236.273463,1649.573884,0.0,2044.295722,1746.324998,...,2364.913529,1027.487226,540.51642,1352.838128,1461.30387,975.84425,1111.287991,1534.200769,1549.513149,1063.73963
Bag of bones,2255.923979,1544.568224,1934.238868,1810.056353,2090.044736,1129.255507,1486.74443,2044.295722,0.0,1820.707555,...,966.626608,2037.792678,2114.754123,2058.56892,1945.422833,1989.592672,1872.812324,1011.041542,2044.775293,1393.074657
A time to kill,3375.075851,1452.476162,2128.606117,1192.015939,1558.343672,2067.315167,1606.895454,1746.324998,1820.707555,0.0,...,2012.535466,1338.993279,1682.908494,1837.253385,1855.588855,1664.343414,1270.294454,1503.701766,1970.953576,1496.159417


For example, *The Stand* is closest to *Desperation* and *Bag of Bones*.

As we'll learn in *Lab 8*, there are imperfections to this type of comparison because of document size and the fact that different words are more or less important. Applying tf-idf to these values will get more realistic comparisons.