### Vectorization
Post initial text preprocessing, we need to transform the text into a meaningful vector of numbers such that a model can perform an operation on the same. There are several techniques to achieve these. Few popular ones are:

1. Bag of Words
2. TF- IDF (Term Frequencey - Inverse Document Frequency)
3. N-Grams Model

In [32]:
import nltk
import string
import pandas as pd
import os

### Bag of Words

a method for representing text data by counting the occurrences of words within a document, disregarding grammar and word order

In [2]:
text = "A kind slave ran away from his cruel master and hid in the forest. There, he saw a lion roaring in pain because a big thorn was stuck in its paw. Even though he was scared, the slave helped the lion by pulling the thorn out. The lion went back into the woods, free and happy. Later, the slave was caught and sent to be punished by being thrown into a lion’s den. But the lion didn’t harm him—it was the same lion he had helped! Moral of the story: A good deed is never forgotten. Be kind, and kindness will come back to you."

let's cleanup the text little bit, ideally we must perform complete preprocessing for better results. However for learning we can just do a bit of it.

In [13]:
text = text.replace(".", "SENTBREAKER")
for p in string.punctuation:
    text = text.replace(p, "")

In [15]:
# let's look into the text
text

'A kind slave ran away from his cruel master and hid in the forestSENTBREAKER There he saw a lion roaring in pain because a big thorn was stuck in its pawSENTBREAKER Even though he was scared the slave helped the lion by pulling the thorn outSENTBREAKER The lion went back into the woods free and happySENTBREAKER Later the slave was caught and sent to be punished by being thrown into a lion’s denSENTBREAKER But the lion didn’t harm him—it was the same lion he had helped Moral of the story A good deed is never forgottenSENTBREAKER Be kind and kindness will come back to youSENTBREAKER'

In [17]:
# let's create a sentences to understand the concept. If we take word as token then we can not derive anything from that.
textSentenceToken = [item.strip() for item in text.split("SENTBREAKER") if item]
textSentenceToken

['A kind slave ran away from his cruel master and hid in the forest',
 'There he saw a lion roaring in pain because a big thorn was stuck in its paw',
 'Even though he was scared the slave helped the lion by pulling the thorn out',
 'The lion went back into the woods free and happy',
 'Later the slave was caught and sent to be punished by being thrown into a lion’s den',
 'But the lion didn’t harm him—it was the same lion he had helped Moral of the story A good deed is never forgotten',
 'Be kind and kindness will come back to you']

let's take the unique words within dataset, to determine the vector for each sentence.

In [19]:
uniqueWords = list(set(text.split()))
len(uniqueWords)

71

In [35]:
# let's look few uniquewords
uniqueWords[:10]

['Later',
 'come',
 'to',
 'Even',
 'slave',
 'hid',
 'never',
 'was',
 'will',
 'kindness']

let's compute the vector by computing the occurrance. 

In [26]:
vectorSpaces = []
# looping through each sentence
for sentence in textSentenceToken:
    vectorSpace = {}
    sentenceWordList = sentence.split()
    for uw in uniqueWords:
        if uw in sentenceWordList:
            if uw not in vectorSpace:
                vectorSpace[uw] = 1
            else:
                vectorSpace[uw] += 1
        else:
            vectorSpace[uw] = 0

    vectorSpaces.append(list(vectorSpace.values()))

In [27]:
# let's look into the vectors
vectorSpaces

[[0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0],
 [0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0],
 [0,
  0,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0

let's visualize with pandas

In [30]:
df = pd.DataFrame(data=vectorSpaces, columns=uniqueWords,  index=textSentenceToken)

In [34]:
df.to_csv(os.path.join(os.getcwd(), "datafiles", "bag_of_wrods_result.csv") )

Drawback: Vector size grows with the text. High Dimensonality

It does not put the weight on the context.