## Cosine Similarity Notebook 

#### @Author: Jyontika Kapoor


#### **Steps:**
Step 1: Create the vocabulary of all unique terms (each of them will be a dimension) 

Step 2: Represent each document and the query in the vector space created by these terms

Step 3: Calculate the cosine similarity between the query and each document

Step 4: Rank the results based on the cosine similarity


In [1]:
import numpy as np
import pandas as pd
import string

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [7]:

def create_vocabulary(document_text):
    """
    Step 1: Create the vocabulary of all unique terms (each of them will be a dimension)
    
    @param: document_text is an input  text for vocabulary creation
    """
    document_text = document_text.lower()
    text = "".join(char for char in document_text if char not in string.punctuation)
    words = set(text.split())
    vocabulary = sorted(words)

    return vocabulary


In [8]:
def test_create_vocabulary():
    # Test Case 1: Basic functionality
    text_1 = "This is a simple test."
    result_1 = create_vocabulary(text_1)
    assert result_1 == ['a', 'is', 'simple', 'test', 'this']

    # Test Case 2: Empty input
    text_2 = ""
    result_2 = create_vocabulary(text_2)
    assert result_2 == []

    # Test Case 3: Input with punctuation
    text_3 = "Python is awesome!"
    result_3 = create_vocabulary(text_3)
    assert result_3 == ['awesome', 'is', 'python']

    # Test Case 4: Input with repeated words
    text_4 = "Testing testing one two two"
    result_4 = create_vocabulary(text_4)
    assert result_4 == ['one', 'testing', 'two']

    print("All test cases passed!")

# Run test cases
test_create_vocabulary()


All test cases passed!


In [9]:
##Step 2: Represent each document and the query in the vector space created by these terms

def sentence_to_vector(sentence, vocabulary):
    """Represents each sentence as a vector"""
    sentence_vector = [sentence.lower().split().count(word) for word in vocabulary]
    return sentence_vector


In [11]:
sentences = [
        "This is the first sentence.",
        "The second sentence is here.",
        "And this is the third one.",
    ]


all_text = " ".join(sentences)
vocab = create_vocabulary(all_text)

# Convert each sentence to a vector
sentence_vectors = [sentence_to_vector(sentence, vocab) for sentence in sentences]


for i, vector in enumerate(sentence_vectors, 1):
        print(f"Sentence {i} Vector: {vector}")

Sentence 1 Vector: [0, 1, 0, 1, 0, 0, 0, 1, 0, 1]
Sentence 2 Vector: [0, 0, 0, 1, 0, 1, 1, 1, 0, 0]
Sentence 3 Vector: [1, 0, 0, 1, 0, 0, 0, 1, 1, 1]


In [14]:
# Step 3: calculate cosine similarity  between query and each document

def calculate_cosine_similarity(query_vector, document_vector):
    """
    Calculate cosine similarity between a query vector and a document vector.
    """
    #  dot product
    dot_product = sum(i * j for i, j in zip(query_vector, document_vector))

    #  magnitudes
    query_magnitude = sum(i**2 for i in query_vector) ** 0.5
    document_magnitude = sum(i**2 for i in document_vector) ** 0.5

    #  cosine similarity
    if query_magnitude == 0 or document_magnitude == 0:
        return 0  # if 1 or both vectors are 0 vectors
    else:
        similarity = dot_product / (query_magnitude * document_magnitude)
        return similarity



In [13]:
#test
query_vector = [1, 0, 1]
document_vector = [0, 1, 1]

cosine_similarity_value = calculate_cosine_similarity(query_vector, document_vector)
print(f"Cosine Similarity: {cosine_similarity_value}")


Cosine Similarity: 0.4999999999999999


In [20]:
#Step 4: rank documents

def rank_results(query_vector, document_vectors):
    """
    Rank the results based on cosine similarity between the query and each document.
    """
    results = []

    for i, document_vector in enumerate(document_vectors):
        similarity = calculate_cosine_similarity(query_vector, document_vector)
        results.append((i, similarity))

    # Rank results based on cosine similarity
    ranked_results = sorted(results, key=lambda x: x[1], reverse=True)
    return ranked_results


In [19]:
# tests
query_vector = [1, 0, 1]
document_vectors = [[0, 1, 1], [1, 1, 0], [0, 0, 1]]

ranked_results = rank_results(query_vector, document_vectors)

for rank, (doc_index, similarity) in enumerate(ranked_results, 1):
    print(f"Rank {rank}: Document {doc_index + 1} - Cosine Similarity: {similarity}")


Rank 1: Document 3 - Cosine Similarity: 0.7071067811865475
Rank 2: Document 1 - Cosine Similarity: 0.4999999999999999
Rank 3: Document 2 - Cosine Similarity: 0.4999999999999999
