![](https://cdn-images-1.medium.com/max/800/0*IeVJrm09xE3mrRr6.jpg)
Photo Credit: https://pixabay.com/en/hong-kong-night-light-rail-city-2288999/

# 3 basic Distance Measurement in Text Mining
In NLP, we also want to find the similarity among sentence or document. Text is not like number and coordination that we cannot compare the different between "Apple" and "Orange" but similarity score can be calculated.

# Why?
Since we cannot simply subtract between "Apple is fruit" and "Orange is fruit" so that we have to find a way to convert text to numeric in order to calculate it. Having the score, we can understand how similar among two objects.

# When?
In my data science work, I tried:
- Compare whether 2 article are describing same news
- Identifying similar documents
- Classifying the category by giving product description

# How?
In this article, we will go through 4 basic distance measurements:
Euclidean Distance
Cosine Distance
Jaccard Similarity

Before any distance measurement, text have to be tokenzied. If you do not familiar with word tokenization, you can visit this [article](https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3).

In [3]:
import pandas as pd
import numpy as np
import nltk
import sklearn

In [38]:
"""
    News headline get from 
    
    https://www.reuters.com/article/us-musk-tunnel/elon-musks-boring-co-to-build-high-speed-airport-link-in-chicago-idUSKBN1JA224
    http://money.cnn.com/2018/06/14/technology/elon-musk-boring-company-chicago/index.html
    https://www.theverge.com/2018/6/13/17462496/elon-musk-boring-company-approved-tunnel-chicago

"""

news_headline1 = "Elon Musk's Boring Co to build high-speed airport link in Chicago"
news_headline2 = "Elon Musk's Boring Company to build high-speed Chicago airport link"
news_headline3 = "Elon Musk’s Boring Company approved to build high-speed transit between downtown Chicago and O’Hare Airport"
news_headline4 = "Both apple and orange are fruit"

news_headlines = [news_headline1, news_headline2, news_headline3, news_headline4]

# Preprocessing

Tokenize headline to list of words

In [5]:
news_headline1_tokens = nltk.word_tokenize(news_headline1)
news_headline2_tokens = nltk.word_tokenize(news_headline2)
news_headline3_tokens = nltk.word_tokenize(news_headline3)
news_headline4_tokens = nltk.word_tokenize(news_headline4)

for words in [news_headline1_tokens, news_headline2_tokens, news_headline3_tokens, news_headline4_tokens]:
    print('First 7 tokens from news headlines: ', words[:7])

First 7 tokens from news headlines:  ['Elon', 'Musk', "'s", 'Boring', 'Co', 'to', 'build']
First 7 tokens from news headlines:  ['Elon', 'Musk', "'s", 'Boring', 'Company', 'to', 'build']
First 7 tokens from news headlines:  ['Elon', 'Musk', '’', 's', 'Boring', 'Company', 'approved']
First 7 tokens from news headlines:  ['Both', 'apple', 'and', 'orange', 'are', 'fruit']


In [7]:
from numpy import argmax

def transform(headlines):
    tokens = [w for s in headlines for w in s ]
    print()
    print('All Tokens:')
    print(tokens)

    results = []
    label_enc = sklearn.preprocessing.LabelEncoder()
    onehot_enc = sklearn.preprocessing.OneHotEncoder()
    
    encoded_all_tokens = label_enc.fit_transform(list(set(tokens)))
    encoded_all_tokens = encoded_all_tokens.reshape(len(encoded_all_tokens), 1)
    
    onehot_enc.fit(encoded_all_tokens)
    
    for headline_tokens in headlines:
        print()
        print('Original Input:', headline_tokens)
        
        encoded_words = label_enc.transform(headline_tokens)
        print('Encoded by Label Encoder:', encoded_words)
        
        encoded_words = onehot_enc.transform(encoded_words.reshape(len(encoded_words), 1))
        print('Encoded by OneHot Encoder:')
        print(encoded_words)

        results.append(np.sum(encoded_words.toarray(), axis=0))
    
    return results

transformed_results = transform([
    news_headline1_tokens, news_headline2_tokens, news_headline3_tokens, news_headline4_tokens])


All Tokens:
['Elon', 'Musk', "'s", 'Boring', 'Co', 'to', 'build', 'high-speed', 'airport', 'link', 'in', 'Chicago', 'Elon', 'Musk', "'s", 'Boring', 'Company', 'to', 'build', 'high-speed', 'Chicago', 'airport', 'link', 'Elon', 'Musk', '’', 's', 'Boring', 'Company', 'approved', 'to', 'build', 'high-speed', 'transit', 'between', 'downtown', 'Chicago', 'and', 'O', '’', 'Hare', 'Airport', 'Both', 'apple', 'and', 'orange', 'are', 'fruit']

Original Input: ['Elon', 'Musk', "'s", 'Boring', 'Co', 'to', 'build', 'high-speed', 'airport', 'link', 'in', 'Chicago']
Encoded by Label Encoder: [ 7  9  0  2  5 25 17 20 11 22 21  4]
Encoded by OneHot Encoder:
  (0, 7)	1.0
  (1, 9)	1.0
  (2, 0)	1.0
  (3, 2)	1.0
  (4, 5)	1.0
  (5, 25)	1.0
  (6, 17)	1.0
  (7, 20)	1.0
  (8, 11)	1.0
  (9, 22)	1.0
  (10, 21)	1.0
  (11, 4)	1.0

Original Input: ['Elon', 'Musk', "'s", 'Boring', 'Company', 'to', 'build', 'high-speed', 'Chicago', 'airport', 'link']
Encoded by Label Encoder: [ 7  9  0  2  6 25 17 20  4 11 22]
Encod

### Euclidean Distance

![](https://cdn-images-1.medium.com/max/800/0*Bd8VtxN8ql4qw4vo)

Photo Credit: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

Comparing the shortest distance among two objects. It uses Pythagorean Theorem which learnt from secondary school.

Score means the distance between two objects. If it is 0, it means that both objects are identical. The following example shows score when comparing the first sentence.

In [49]:
print('Master Sentence: %s' % news_headlines[0])
for i, news_headline in enumerate(news_headlines):
    score = sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0][0]
    print('-----')
    print('Score: %.2f, Comparing Sentence: %s' % (score, news_headline))

Master Sentence: Elon Musk's Boring Co to build high-speed airport link in Chicago
-----
Score: 0.00, Comparing Sentence: Elon Musk's Boring Co to build high-speed airport link in Chicago
-----
Score: 1.73, Comparing Sentence: Elon Musk's Boring Company to build high-speed Chicago airport link
-----
Score: 4.36, Comparing Sentence: Elon Musk’s Boring Company approved to build high-speed transit between downtown Chicago and O’Hare Airport
-----
Score: 4.24, Comparing Sentence: Both apple and orange are fruit


### Cosine Similarity

![](https://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/cosine.png)

Photo Credit: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

Determine the angle between two objects is the calculation method to the find similarity. The range of score is 0 to 1. If score is 1, it means that they are same in orientation (not magnitude). The following example shows score when comparing the first sentence.

In [52]:
print('Master Sentence: %s' % news_headlines[0])
for i, news_headline in enumerate(news_headlines):
    score = sklearn.metrics.pairwise.cosine_similarity([transfaormed_results[i]], [transformed_results[0]])[0][0]
    print('-----')
    print('Score: %.2f, Comparing Sentence: %s' % (score, news_headline))

Master Sentence: Elon Musk's Boring Co to build high-speed airport link in Chicago
-----
Score: 1.00, Comparing Sentence: Elon Musk's Boring Co to build high-speed airport link in Chicago
-----
Score: 0.87, Comparing Sentence: Elon Musk's Boring Company to build high-speed Chicago airport link
-----
Score: 0.44, Comparing Sentence: Elon Musk’s Boring Company approved to build high-speed transit between downtown Chicago and O’Hare Airport
-----
Score: 0.00, Comparing Sentence: Both apple and orange are fruit


### Jaccard Similarity
![](https://i0.wp.com/dataaspirant.com/wp-content/uploads/2015/04/jaccard_similariyt.png?resize=768%2C307)
Photo Credit: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

The measurement is refer to number of common words over all words. More commons mean both objects should be similarity.
Jaccard Similarity = (Intersection of A and B) / (Union of A and B)
The range is 0 to 1. If score is 1, it means that they are identical. There is no any common word between the first sentence and the last sentence so the score is 0. The following example shows score when comparing the first sentence.

In [55]:
"""
    Finding the posistion (from lookup table) of word instead of using 1 or 0
    to prevent misleading of the meaning of "common" word
"""

def calculate_position(values):
    x = []
    for pos, matrix in enumerate(values):
        if matrix > 0:
            x.append(pos)
    return x

"""
    Since scikit-learn can only compare same number of dimension of input. 
    Add padding to the shortest sentence.
"""
def padding(sentence1, sentence2):
    x1 = sentence1.copy()
    x2 = sentence2.copy()
    
    diff = len(x1) - len(x2)
    
    if diff > 0:
        for i in range(0, diff):
            x2.append(-1)
    elif diff < 0:
        for i in range(0, abs(diff)):
            x1.append(-1)
    
    return x1, x2    

y_actual = calculate_position(transformed_results[0])

print('Master Sentence: %s' % news_headlines[0])
for i, news_headline in enumerate(news_headlines):
    y_compare = calculate_position(transformed_results[i])
    x1, x2 = padding(y_actual, y_compare)
    score = sklearn.metrics.jaccard_similarity_score(x1, x2)
    print('-----')
    print('Score: %.2f, Comparing Sentence: %s' % (score, news_headline))
    

Master Sentence: Elon Musk's Boring Co to build high-speed airport link in Chicago
-----
Score: 1.00, Comparing Sentence: Elon Musk's Boring Co to build high-speed airport link in Chicago
-----
Score: 0.67, Comparing Sentence: Elon Musk's Boring Company to build high-speed Chicago airport link
-----
Score: 0.17, Comparing Sentence: Elon Musk’s Boring Company approved to build high-speed transit between downtown Chicago and O’Hare Airport
-----
Score: 0.00, Comparing Sentence: Both apple and orange are fruit


# Conclusion
Three methods also have same assumption which is the document (or sentence) are 
__similar if having common words
__. This idea is very straight forward and simple. It fits some basic cases such as comparing first 2 sentence. However, the score is relative low by comparing first sentence and third sentence although both of them describe same news. 

Another limitation is that above methods __does not handle synonym scenario__. For example buy and purchase, it should have same meaning (in some cases) but above methods will treat both words are difference. 

So what is the cue? You may consider to use Word Embedding which introduced by Tomas Mikolov in 2013.