This notebook is for testing out different approaches to comparing two different samples of text and measuring how similar they are.

In [1]:
sample1 = "The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'll get points based on the cost of the products. You don't need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'll find the savings for you."
sample2 = "The easiest way to earn points with Fetch Rewards is to just shop for the items you already buy. If you have any eligible brands on your receipt, you will get points based on the total cost of the products. You do not need to cut out any coupons or scan individual UPCs. Just scan your receipt after you check out and we will find the savings for you."
sample3 = "We are always looking for opportunities for you to earn more points, which is why we also give you a selection of Special Offers. These Special Offers are opportunities to earn bonus points on top of the regular points you earn every time you purchase a participating brand. No need to pre-select these offers, we'll give you the points whether or not you knew about the offer. We just think it is easier that way."

samples = {
    'Sample1' : sample1,
    'Sample2' : sample2,
    'Sample3' : sample3
}

In [2]:
for sample_num, text in samples.items():
    print(f'{sample_num} has {len(text.split())} words')

Sample1 has 64 words
Sample2 has 69 words
Sample3 has 75 words


___
### First approach - Word Count and Cosine Similarity
For this first approach I tackled it by doing a word count for each text sample and then measured the similarity using cosine similarity. 
* Clean text
* Set up arrays that will count the occurences of words in each sample
* Calulate the cosine similarity between the resulting arrays

In [3]:
import numpy as np
from itertools import combinations

In [4]:
def simple_clean(text:str):
    '''Simple function to remove [,.] and to make all letters lowercase.'''
    return text.lower().replace(',','').replace('.','')

In [5]:
def cos_sim(vector1, vector2):
    '''
    Recreate the cosine similarity function. 
    Takes two vectors and outputs a float between 0 and 1.
    Closer to 1 means they are more similar.
    Closer to 0 means they are less similar.
    '''
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

In [6]:
def vectorize_text(text1:str, text2:str):
    '''
    Takes two strings and returns two vectors of their word counts.
    '''
    word_list = []
    t1_word_count = []
    
    #Vectorize text1
    for word in simple_clean(text1).split():
        if word in word_list:
            t1_word_count[word_list.index(word)] += 1
        else:
            word_list.append(word)
            t1_word_count.append(1)
    
    #Create list of zeros for text2
    t2_word_count = [0 for _ in t1_word_count]
    
    #Vectorize text2
    for word in simple_clean(text2).split():
        if word in word_list:
            t2_word_count[word_list.index(word)] += 1
        else:
            word_list.append(word)
            t1_word_count.append(0)
            t2_word_count.append(1)
            
    return word_list, t1_word_count, t2_word_count

In [7]:
def measure_similarity(samples):
    for (name1,text1),(name2,text2) in combinations(samples.items(), 2):
        vector1, vector2 = vectorize_text(text1, text2)[1:]
        similarity_score = cos_sim(vector1, vector2)
        print(f'{name1} and {name2} have a similarity of {similarity_score:.2f}')

In [8]:
measure_similarity(samples)

Sample1 and Sample2 have a similarity of 0.89
Sample1 and Sample3 have a similarity of 0.56
Sample2 and Sample3 have a similarity of 0.58
