## Objective 01 - represent a document as vector


### Challenge
For this challenge, try creating your own corpus or collection of sentences. You can adjust randomly generated sentences to include words that are common to some or all of your sentences. With this corpus, create a count vector, a one-hot encoded vector, and a tf-idf vector. Do the values make sense for each document?

In [14]:
# Create the corpus with random sentences

corpus1 = ["Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.",
          "Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind.",
          "A room without books is like a body without a soul.",
          "Be the change that you wish to see in the world."
         ]

In [15]:
# Frequency-count

# Import the feature_extraction module and vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the object and count the words
vectorizer1 = CountVectorizer()
vectors1 = vectorizer1.fit_transform(corpus1)

# Convert to dense vectors (leave out the zeroes)
print(vectors.todense())

[[1 2 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 2 1 0 0 1 2 0 0 0 0 0
  0]
 [0 2 1 1 1 0 0 0 2 1 0 0 0 0 0 2 2 0 0 1 0 0 0 0 0 0 0 2 0 0 0 1 3 0 0 0
  2]
 [0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0
  0]
 [0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 2 0 0 1 0 0 0 0 1 0 1
  1]]


In [16]:
# One-hot encoding of word counts

# Import the binary encoder
from sklearn.preprocessing import Binarizer

# Initialize the vectorizer and get the word counts
freq1   = CountVectorizer()
corpus_freq1 = freq1.fit_transform(corpus1)

# Initialize the binarizer and create the binary encoded vector
onehot1 = Binarizer()
corpus_onehot1 = onehot1.fit_transform(corpus_freq1.toarray())

# Display the one-hot encoded vector
corpus_onehot1

array([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1]])

In [17]:
# Import libraries and modules
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate vectorizer object
tfidf1 = TfidfVectorizer(stop_words='english', max_features=5000)

# Create a vocabulary and get word counts per document
dtm1 = tfidf1.fit_transform(corpus)

# Get feature names to use as dataframe column headers
dtm1 = pd.DataFrame(dtm1.todense(), columns=tfidf1.get_feature_names())

# View feature matrix as DataFrame
dtm1.head()

Unnamed: 0,angry,cereal,chameleon,chose,color,fruit,hated,karma,killer,loops,loved,paintbrush,stomped,use
0,0.0,0.0,0.12663,0.0,0.0,0.0,0.0,0.99195,0.0,0.0,0.0,0.0,0.0,0.0
1,0.430037,0.0,0.274487,0.430037,0.430037,0.0,0.0,0.0,0.0,0.0,0.0,0.430037,0.0,0.430037
2,0.0,0.366739,0.0,0.0,0.0,0.465162,0.0,0.0,0.465162,0.465162,0.0,0.0,0.465162,0.0
3,0.0,0.321093,0.259952,0.0,0.0,0.0,0.81453,0.0,0.0,0.0,0.407265,0.0,0.0,0.0
