## TEXT SIMILARITY (or) MATCHING BETWEEN TWO STRINGS / DOCUMENTS

#### IMPORTING MODULES

In [1]:
from sklearn.metrics import jaccard_similarity_score

In [2]:
str1="this is the main point"
str2="a point is the circle with zero radius"

#### JACCARD SIMILARITY

**Acc. to wiki**

The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

In [3]:
def jaccard(strt1,str2):
    set1=set(str1.split())
    set2=set(str2.split())
    int=len(set1.intersection(set2))
    uni=len(set1.union(set2))
    return int/uni

In [4]:
jaccard(str1,str2)

0.3

#### LEVENSHTEIN DISTANCE or EDIT DISTANCE

#### Acc to wiki

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

In [5]:
import nltk
nltk.edit_distance(str1,str2)


25

In [6]:
nltk.edit_distance('hokey','cokey')

1

#### Till now we have seen metrics that finds similarity between two strings. Now we move onto metrics that requires stings or docs as vectors. 

**We can either use Bag-of-Words approach or the word embeddings approach.**

#### CREATING DOC VECTORS USING BOW APPROACH

In [7]:
corp=[str1,str2]
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
vect=TfidfVectorizer()
x_tfidf=vect.fit_transform(corp)
doc_vecs=x_tfidf.toarray()

In [8]:
doc_vecs.shape  # 2 docs and 9 features (or unigrams are extracted)

(2, 9)

In [9]:
doc_vecs

array([[0.        , 0.37930349, 0.53309782, 0.37930349, 0.        ,
        0.37930349, 0.53309782, 0.        , 0.        ],
       [0.42567716, 0.30287281, 0.        , 0.30287281, 0.42567716,
        0.30287281, 0.        , 0.42567716, 0.42567716]])

#### MANHATTAN DISTANCE

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the total suzm of the difference between the x-coordinates  and y-coordinates.

In [10]:
def manhattan_distance(doc_vec1,doc_vec2):
    return sum(abs(a-b) for a,b in zip(doc_vec1,doc_vec2))

In [11]:
dist=manhattan_distance(doc_vecs[0],doc_vecs[1])
print(dist)

2.998196356954673


#### EUCLIDEAN DISTANCE

The Euclidean distance between two points is the length of the path connecting them.The Pythagorean theorem gives this distance between two points.

In [12]:
import math
def euclidean_distance(doc_vec1,doc_vec2):
    return math.sqrt(sum(pow(a-b,2) for a,b in zip(doc_vec1,doc_vec2)))
 

In [13]:
dist=euclidean_distance(doc_vecs[0],doc_vecs[1])
print(dist)

1.1448649343585864


#### MINKOWSKI DISTANCE 

The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.

Read [this](https://en.wikipedia.org/wiki/Minkowski_distance).

In [14]:
from sklearn.metrics import pairwise_distances # check out doc. it's nice.

In [15]:
pairwise_distances(doc_vecs[0].reshape(1,-1),doc_vecs[1].reshape(1,-1),metric='minkowski') # correct the shape

array([[1.14486493]])

#### COSINE SIMILARITY

Cosine similarity metric finds the normalized dot product of the two attributes

In [16]:
from sklearn.metrics import pairwise_distances # check out doc. it's nice.
pairwise_distances(doc_vecs[0].reshape(1,-1),doc_vecs[1].reshape(1,-1),metric='cosine') # correct the shape

array([[0.65535786]])

#### Similarly we can use the word embeddings. then we can represent doc using word embeddings by either averaging or weighted tfidf scores or doc2vec etc... and then compute similarity between documents.