### Vectorization
Post initial text preprocessing, we need to transform the text into a meaningful vector of numbers such that a model can perform an operation on the same. There are several techniques to achieve these. Few popular ones are:

1. Bag of Words
2. TF- IDF (Term Frequencey - Inverse Document Frequency)
3. N-Grams Model

#### TF-IDF
term frequence - inverse document frequency.

tf: term frequency: how many times a word appears into the single document

tf: count of word appreance / len(single document words)


idf: inverse document frequency; in how many documents a word appears 

idf: log(total documents / total appreance)


In [22]:
import nltk
import string
import pandas as pd
import os
import math
import numpy as np

In [4]:
quotes = ['Be yourself; everyone else is already taken.',
 'The biggest adventure you can take is to live the life of your dreams.',
 'The only thing we have to fear is fear itself.',
 'Some people want it to happen, some wish it would happen, others make it happen.',
 'You’ve got to be in it to win it.',
 'It does not matter how slowly you go, as long as you do not stop.',
 'Find out who you are and do it on purpose.',
 'For me, becoming isn’t about arriving somewhere or achieving a certain aim. I see it instead as forward motion, a means of evolving, a way to reach continuously toward a better self. The journey doesn’t end.',
 'Confident people have a way of carrying themselves that makes others attracted to them.',
 'If you can do what you do best and be happy, you are further along in life than most people.',
 'You can be everything. You can be the infinite amount of things that people are.',
 'Always go with your passions. Never ask yourself if it’s realistic or not.',
 'When you change your thoughts, remember to also change your world.',
 'The more you know who you are, and what you want, the less you let things upset you.',
 'By being yourself, you put something wonderful in the world that was not there before.',
 'Do one thing every day that scares you.',
 'It is never too late to be what you might have been.',
 'Find out who you are and be that person. That’s what your soul was put on this earth to be. Find the truth, live that truth, and everything else will come.',
 'When we are no longer able to change a situation, we are challenged to change ourselves.',
 'If you cannot do great things, do small things in a great way.',
 'Always do your best. What you plant now, you will harvest later.',
 'Get busy living or get busy dying.',
 'In three words I can sum up everything I’ve learned about life: It goes on.',
 'You can’t help what you feel, but you can help how you behave.',
 'No need to hurry. No need to sparkle. No need to be anybody but oneself.',
 'Promise me you’ll always remember: You’re braver than you believe, and stronger than you seem, and smarter than you think.',
 'Failure is a great teacher and, if you are open to it, every mistake has a lesson to offer.',
 'If you don’t like the road you’re walking, start paving another one.',
 'Don’t let yesterday take up too much of today.',
 'Keep smiling, because life is a beautiful thing and there’s so much to smile about.',
 'Be persistent and never give up hope.',
 'When we strive to become better than we are, everything around us becomes better too.',
 'Believe and act as if it were impossible to fail.',
 'There are so many great things in life; why dwell on negativity?',
 'Happiness often sneaks in through a door you didn’t know you left open.',
 'Always remember that you are absolutely unique. Just like everyone else.',
 'Keep your face towards the sunshine and shadows will fall behind you.',
 'A problem is a chance for you to do your best.',
 'You don’t always need a plan. Sometimes you just need to breathe, trust, let go and see what happens.',
 'Nothing is impossible. The word itself says ‘I’m possible!’',
 'Life does not have to be perfect to be wonderful.',
 'It is during our darkest moments that we must focus to see the light.',
 'The best way out is through.',
 'Don’t be afraid to give up the good to go for the great.',
 'Whether you think you can or you can’t, you’re right.',
 'Don’t take yourself too seriously. Know when to laugh at yourself, and find a way to laugh at obstacles that inevitably present themselves.',
 'Love the life you live. Live the life you love.',
 'Keep your face towards the sunshine and shadows will fall behind you.',
 'The only person you are destined to become is the person you decide to be.',
 'I’m not going to continue knocking that old door that doesn’t open for me. I’m going to create my own door and walk through that.',
 'If you change the way you look at things, the things you look at change.',
 'I believe that if you’ll just stand up and go, life will open up for you. Something just motivates you to keep moving.',
 'Once you face your fear, nothing is ever as hard as you think.']

let's cleanup the text little bit, ideally we must perform complete preprocessing for better results. However for learning we can just do a bit of it.

In [12]:
cleanedQuotes = []
for q in quotes:
    for p in string.punctuation:
        q = q.replace(p, "")

    cleanedQuotes.append(q)

cleanedQuotes[:5]

['Be yourself everyone else is already taken',
 'The biggest adventure you can take is to live the life of your dreams',
 'The only thing we have to fear is fear itself',
 'Some people want it to happen some wish it would happen others make it happen',
 'You’ve got to be in it to win it']

let's compute some metadata that will help us with tf-idf computaion

In [15]:
# unique words count
uniqueWords = list(set(" ".join(cleanedQuotes).split()))
print(len(uniqueWords))

totalDocumentCount = len(cleanedQuotes)
print(totalDocumentCount)

321
53


let's calculate tf-idf and form the vector

In [40]:
vectorSpaces = []


for quote in cleanedQuotes:
    crr_quote_list = quote.split()
    vectorSpace = []
    for uw in uniqueWords: 
        if uw in crr_quote_list:
            # tf - term frequency: how many times a word came into the single document
            tf = quote.count(uw) / len(crr_quote_list)

            # idf - inverse document frequency; in how many documents a word appears 
            wordAppearedCount = len([q for q in cleanedQuotes if q.count(uw) > 0])
            idf = math.log10(totalDocumentCount/wordAppearedCount)

            vectorSpace.append(tf*idf)
        else:
            vectorSpace.append(0)
    vectorSpaces.append({quote:vectorSpace})



In [41]:
# let's look into some data
vectorSpaces[:2]

[{'Be yourself everyone else is already taken': [0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0.16031655403897524,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0.17816494498301808,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0.24632512422868413,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,


let's define the definiion to compute cosine similarities between two vectors

In [23]:
def calculate_cosine_similarity(vector_a, vector_b):
    """
    Calculates the cosine similarity between two NumPy arrays (vectors).
    """
    dot_product = np.dot(vector_a, vector_b)
    magnitude_a = np.linalg.norm(vector_a)
    magnitude_b = np.linalg.norm(vector_b)

    if magnitude_a == 0 or magnitude_b == 0:
        return 0  # Handle cases where one or both vectors are zero vectors

    cosine_similarity = dot_product / (magnitude_a * magnitude_b)
    return cosine_similarity
 

Now, let's give some partial quote and see what is the actual code which matches with the vector cosine similarity

In [25]:
partialQuote = "Always remember"

to get the similar quote, we have to first compute the vector space for this.

In [42]:

crr_quote_list = partialQuote.split()
vectorSpace = []
for uw in uniqueWords: 
    if uw in crr_quote_list:
    
        # tf - term frequency: how many times a word came into the single document
        tf = partialQuote.count(uw) / len(crr_quote_list)

        # idf - inverse document frequency; in how many documents a word appears 
        wordAppearedCount = len([q for q in cleanedQuotes if q.count(uw) > 0])
        idf = math.log10(totalDocumentCount/wordAppearedCount)

        vectorSpace.append(tf*idf)

    else:
        vectorSpace.append(0)

queryVector = vectorSpace
print(queryVector)



[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.6235773074405633, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.6235773074405633, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


now, let's compute cosine similarity with new vector

In [53]:
cosineSimilarities = []

for vs in vectorSpaces: 
    cosineSimilarities.append((list(vs.keys())[0], calculate_cosine_similarity(queryVector, list(vs.values())[0])))

In [54]:
cosineSimilarities[:2]

[('Be yourself everyone else is already taken', np.float64(0.0)),
 ('The biggest adventure you can take is to live the life of your dreams',
  np.float64(0.0))]

In [47]:
import heapq

In [55]:
heapq.heapify(cosineSimilarities)

In [60]:
# getting 3 quotes that are matching with the partial quote
matchedQuotes = heapq.nlargest(3, cosineSimilarities, key=lambda x: x[1] )
matchedQuotes

[('Always remember that you are absolutely unique Just like everyone else',
  np.float64(0.40860901827254376)),
 ('Always do your best What you plant now you will harvest later',
  np.float64(0.20639354586651698)),
 ('When you change your thoughts remember to also change your world',
  np.float64(0.19953133095807818))]

In [63]:
# Result

print(f"For given partial quote - '{partialQuote}', below quote is matching with cosine similaity as {matchedQuotes[0][1]} \n '{matchedQuotes[0][0]}'")

For given partial quote - 'Always remember', below quote is matching with cosine similaity as 0.40860901827254376 
 'Always remember that you are absolutely unique Just like everyone else'


like wise, we can build, searching or text suggestion etc.