# Model - Distance Based

Distance based metrics are commonly used methods to create similiarity between two items. They have been used for a long time in Information Retrieval methods.

Here is a good article on [Distance Metrics for Fun and Profit](http://www.benfrederickson.com/distance-metrics/) by Ben on better understanding of these metrics

## Finding similiarity

- Doc2Vec
- Cosine Similarity

In [1]:
import pandas as pd

In [2]:
stories = pd.read_csv("data/stories.csv", low_memory=False)

In [3]:
stories.head()

Unnamed: 0,story,title,score,user
0,14274033,"Uber is valued at $70B, you can get it at $999",503,appoets
1,14107522,Jeff Bezos? Annual Letter,547,djyaz1200
2,13647190,India has banned disposable plastic in Delhi,671,SimplyUseless
3,13309610,Startup Puts Everything You Need for a Two-Acr...,511,kungfudoi
4,13860890,The Uber Bombshell About to Drop,1089,dantiberian


## Creating More features from Title

In [23]:
#! pip install spacy

In [26]:
#! python -m spacy download en

In [4]:
# Lets use spact
import spacy
nlp = spacy.load('en')

In [6]:
title0 = nlp(stories.title[0])
title1 = nlp(stories.title[1])

In [7]:
title0, title1 

(Uber is valued at $70B, you can get it at $999, Jeff Bezos? Annual Letter)

In [8]:
len(title0.vector), len(title1.vector)

(384, 384)

In [10]:
title0.similarity(title1)

-8.1985350679048196e-22

In [13]:
story_similarity = []

The below code is time consuming - and so, we will run it only for the first 100 titles

In [15]:
%%time 

for story_row in stories.title[:100]:
    for story_column in stories.title[:100]:
        story_sim = nlp(story_row).similarity(nlp(story_column))
        story_similarity.append([story_row, story_column, story_sim])

In [24]:
story_similarity = pd.DataFrame(story_similarity,
                               columns = ["story1", "story2", "similarity"] )

In [25]:
story_similarity.head()

Unnamed: 0,story1,story2,similarity
0,"Uber is valued at $70B, you can get it at $999","Uber is valued at $70B, you can get it at $999",0.0
1,"Uber is valued at $70B, you can get it at $999",Jeff Bezos? Annual Letter,-8.198535000000001e-22
2,"Uber is valued at $70B, you can get it at $999",India has banned disposable plastic in Delhi,3.723083e+17
3,"Uber is valued at $70B, you can get it at $999",Startup Puts Everything You Need for a Two-Acr...,0.02046476
4,"Uber is valued at $70B, you can get it at $999",The Uber Bombshell About to Drop,0.0


In [53]:
similarity_matrix = pd.pivot(story_similarity.story1,
        story_similarity.story2,
        story_similarity.similarity)

In [51]:
similarity_matrix.head()

story2,1Password Travel Mode: Protect your data when crossing borders,94-year-old Lithium-Ion Battery Inventor Introduces Solid State Battery,?Startup? asks internship applicant to build their app before phone screen,?The moon blew up without warning and for no apparent reason?,A crashed advertisement reveals logs of a facial recognition system,A lawsuit over Costco golf balls,A rising sentiment that IBM?s Watson can?t deliver on its promises,AMA: NY AG Schneiderman on net neutrality and protecting our voice in government,Accidentally destroyed production database on first day of a job,Afraid of Makefiles? Don't be,...,"Welcome, ACLU",What Is Ethereum?,What the CIA WikiLeaks Dump Tells Us: Encryption Works,Whistleblower uncovers London police hacking of journalists and protestors,Why F.E.A.R.?s AI is still the best in first-person shooters,Why Should I Start a Startup?,Why Slack is inappropriate for open source communications,Why We Must Fight for the Right to Repair Our Electronics,Why do many math books have so much detail and so little enlightenment? (2010),YC will hold interviews in Vancouver for founders who can?t get US visas
story1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1Password Travel Mode: Protect your data when crossing borders,0.0,-2.1966e+17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.5531e+17,-2.9869e+17,0.0,-3.4295e+17,0.0,0.0,0.0,0.0,0.0
94-year-old Lithium-Ion Battery Inventor Introduces Solid State Battery,-2.1966e+17,-1.5766e+17,0.0,0.0,0.0,-1.8643e+17,0.0,-2.1692e+17,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
?Startup? asks internship applicant to build their app before phone screen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-3.2323e+17,0.0,0.0,0.0,0.0,0.0,0.0
?The moon blew up without warning and for no apparent reason?,0.0,0.0,0.0,0.0,0.0,-2.8854e+17,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.2432e+17,0.0,0.0,0.0,0.0
A crashed advertisement reveals logs of a facial recognition system,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.7176e+17,-2.5067e+17,0.0,...,0.0,0.0,0.0,2.4399e+17,0.0,0.0,0.0,0.0,0.0,-2.8151e+17
