### Vectorization
Post initial text preprocessing, we need to transform the text into a meaningful vector of numbers such that a model can perform an operation on the same. There are several techniques to achieve these. Few popular ones are:

1. Bag of Words
2. TF- IDF (Term Frequencey - Inverse Document Frequency)
3. N-Grams Model

In [None]:
import nltk
import string
import pandas as pd
import os
import math
import numpy as np

#### N-Grams
N-grams are contiguous sequences of n item (words or characters) from a given text. There are multiple types of grams that we can implement such as:
1. Unigram (1-gram): individual words
2. Bigram (2-gram) : pairs of consecutive words
3. Trigram (3-gram) : triples of consecutive words and so on..

N-grams helps in text classification, machine learning and transaltion task etc.

In [50]:
# let's take the dataset

quotes = ['Be yourself; everyone else is already taken.',
 'The biggest adventure you can take is to live the life of your dreams.',
 'The only thing we have to fear is fear itself.',
 'Some people want it to happen, some wish it would happen, others make it happen.',
 'You have got to be in it to win it.',
 'It does not matter how slowly you go, as long as you do not stop.',
 'Find out who you are and do it on purpose.',
 'For me, becoming is not about arriving somewhere or achieving a certain aim. I see it instead as forward motion, a means of evolving, a way to reach continuously toward a better self. The journey doesn’t end.',
 'Confident people have a way of carrying themselves that makes others attracted to them.',
 'If you can do what you do best and be happy, you are further along in life than most people.',
 'You can be everything. You can be the infinite amount of things that people are.',
 'Always go with your passions. Never ask yourself if it is realistic or not.',
 'When you change your thoughts, remember to also change your world.',
 'The more you know who you are, and what you want, the less you let things upset you.',
 'By being yourself, you put something wonderful in the world that was not there before.',
 'Do one thing every day that scares you.',
 'It is never too late to be what you might have been.',
 'Find out who you are and be that person. That is what your soul was put on this earth to be. Find the truth, live that truth, and everything else will come.',
 'When we are no longer able to change a situation, we are challenged to change ourselves.',
 'If you cannot do great things, do small things in a great way.',
 'Always do your best. What you plant now, you will harvest later.',
 'Get busy living or get busy dying.',
 'In three words I can sum up everything I have learned about life: It goes on.',
 'You can not help what you feel, but you can help how you behave.',
 'No need to hurry. No need to sparkle. No need to be anybody but oneself.',
 'Promise me you will always remember: You are braver than you believe, and stronger than you seem, and smarter than you think.',
 'Failure is a great teacher and, if you are open to it, every mistake has a lesson to offer.',
 'If you do not like the road you are walking, start paving another one.',
 'Do not let yesterday take up too much of today.',
 'Keep smiling, because life is a beautiful thing and there is so much to smile about.',
 'Be persistent and never give up hope.',
 'When we strive to become better than we are, everything around us becomes better too.',
 'Believe and act as if it were impossible to fail.',
 'There are so many great things in life; why dwell on negativity?',
 'Happiness often sneaks in through a door you did not know you left open.',
 'Always remember that you are absolutely unique. Just like everyone else.',
 'Keep your face towards the sunshine and shadows will fall behind you.',
 'A problem is a chance for you to do your best.',
 'You do not always need a plan. Sometimes you just need to breathe, trust, let go and see what happens.',
 'Nothing is impossible. The word itself says "I am possible!"',
 'Life does not have to be perfect to be wonderful.',
 'It is during our darkest moments that we must focus to see the light.',
 'The best way out is through.',
 'do not be afraid to give up the good to go for the great.',
 'Whether you think you can or you can not, you are right.',
 'do not take yourself too seriously. Know when to laugh at yourself, and find a way to laugh at obstacles that inevitably present themselves.',
 'Love the life you live. Live the life you love.',
 'Keep your face towards the sunshine and shadows will fall behind you.',
 'The only person you are destined to become is the person you decide to be.',
 'I am not going to continue knocking that old door that does not open for me. I am going to create my own door and walk through that.',
 'If you change the way you look at things, the things you look at change.',
 'I believe that if you wll just stand up and go, life will open up for you. Something just motivates you to keep moving.',
 'Once you face your fear, nothing is ever as hard as you think.']

In [None]:
# nltk.download('punkt')

In [4]:
import nltk
from nltk.util import ngrams
from collections import Counter

# Download necessary NLTK resources (if not already downloaded)
# nltk.download('punkt')

text_data = "this is a very good book to study"
# words = nltk.word_tokenize(text_data)  # Tokenize the text
words = text_data.split()

# Generate bigrams
bigrams_nltk = list(ngrams(words, 2))
print(f"NLTK Bigrams: {bigrams_nltk}")



NLTK Bigrams: [('this', 'is'), ('is', 'a'), ('a', 'very'), ('very', 'good'), ('good', 'book'), ('book', 'to'), ('to', 'study')]


In [5]:
# Count frequency of bigrams
bigram_freq = Counter(bigrams_nltk)
print(f"Bigram Frequencies: {bigram_freq}") # Top 5 most common

Bigram Frequencies: Counter({('this', 'is'): 1, ('is', 'a'): 1, ('a', 'very'): 1, ('very', 'good'): 1, ('good', 'book'): 1, ('book', 'to'): 1, ('to', 'study'): 1})


https://towardsdatascience.com/understanding-word-n-grams-and-n-gram-probability-in-natural-language-processing-9d9eef0fa058/

let's combine all the quotes and create grams 

In [51]:
allQuotes = " ".join(quotes).lower()
for p in string.punctuation:
    allQuotes = allQuotes.replace(p, "")

print(allQuotes)

be yourself everyone else is already taken the biggest adventure you can take is to live the life of your dreams the only thing we have to fear is fear itself some people want it to happen some wish it would happen others make it happen you have got to be in it to win it it does not matter how slowly you go as long as you do not stop find out who you are and do it on purpose for me becoming is not about arriving somewhere or achieving a certain aim i see it instead as forward motion a means of evolving a way to reach continuously toward a better self the journey doesn’t end confident people have a way of carrying themselves that makes others attracted to them if you can do what you do best and be happy you are further along in life than most people you can be everything you can be the infinite amount of things that people are always go with your passions never ask yourself if it is realistic or not when you change your thoughts remember to also change your world the more you know who y

In [65]:
def create_grams(datasetList, n):
    ngramslist = []
    for i in range(len(datasetList)-n-1):
        _ngramdata = [datasetList[i]]
        for c in range(1, n): 
            _ngramdata.append(datasetList[i+c])

        ngramslist.append(" ".join(_ngramdata))
    return ngramslist


In [66]:
allQuotesList = allQuotes.split()

In [67]:

ngram_tokens = create_grams(allQuotesList, 2)
ngram_tokens[:5]

['be yourself', 'yourself everyone', 'everyone else', 'else is', 'is already']

no we will add the probability for each word in the ngram_tokens

Example: ['Be', 'yourself']

so we will comput probbaility of 'yourself' given Be

P = count(be|yourself)/count(be)

In [90]:
n_grams = []
for token in ngram_tokens: 
    count_of_text = ngram_tokens.count(token)
    count_of_previous_word = allQuotesList.count(" ".join(token.split()[:-1])) 

    _token_value = [token, count_of_text/count_of_previous_word]

    if _token_value not in n_grams:
        n_grams.append(_token_value)

n_grams[:5]

[['be yourself', 0.07142857142857142],
 ['yourself everyone', 0.2],
 ['everyone else', 1.0],
 ['else is', 0.3333333333333333],
 ['is already', 0.0625]]

In [56]:
import heapq


In [73]:
heapq.heapify(n_grams)
n_grams[:5]

[['a beautiful', 0.06666666666666667],
 ['a certain', 0.06666666666666667],
 ['a better', 0.06666666666666667],
 ['a chance', 0.06666666666666667],
 ['a way', 0.2]]

In [83]:
heapq.nlargest(5, n_grams, key= lambda x: x[1]   )

[['achieving a', 1.0],
 ['afraid to', 1.0],
 ['adventure you', 1.0],
 ['aim i', 1.0],
 ['already taken', 1.0]]

In [88]:
# creating a method to get the 5 suggesion for next possible word
def get_next_word(current_word:str): 
    # get the suggested word
    suggessions = [item for item in n_grams if item[0].split()[0] == current_word]
    heapq.heapify(suggessions)

    possible_suggessions = heapq.nlargest(5, suggessions, key= lambda x: x[1])

    return [item[0] for item in possible_suggessions]
        

In [106]:
print(get_next_word("not"))

['not about', 'not always', 'not be', 'not know', 'not have']


let's try to complete the sentence

In [112]:
test_text = "people thinks that"

In [97]:
import random

In [118]:
new_text_list = test_text.split()
do_predict = True
last_word = new_text_list[-1]
 
while do_predict: 
    suggested_word = get_next_word(last_word)
     
    if suggested_word:
        last_word = random.choice([item.split()[1] for item in suggested_word])
        new_text_list.append(last_word)

        # let's add some terminate condition
        if len(new_text_list) > 10:
            do_predict = False
    else:
        do_predict = False


print("Actual text:: ", test_text)
print("Formed text:: ", " ".join(new_text_list))


Actual text::  people thinks that
Formed text::  people thinks that does not be afraid to also change i


So, we are seeing, sentences are getting formed but there are no context. Thus we use vectorization to add some more context 