# How to Prepare Text Data with scikit-learn
* The text must be parsed to remove words,called **tokenization**.
* Then the words need to be encoded as integers or floating point values for use as input to ml algorithm called **feature extraction(vectorization)**.

## Bag of Words Model
* The model is simple in that it throws away all of the order information in the words and focuses on the occurence of words in a document.
* This can be done by assigning each word a unique number.
* Then any document we see can be encoded
as a fixed-length vector with the length of the vocabulary of known words
* The value in each
position in the vector could be filled with a count or frequency of each word in the encoded
document.
* This is the bag-of-words model, where we are only concerned with encoding schemes that
represent what words are present or the degree to which they are present in encoded documents
without any information about order

## 1. Word Counts with CountVectorizer


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog"]

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


In [2]:
# Importantly,the same vectorizer can be used on documents that contain words not included in the vocabulary
# These words are ignored and no count is given to them.

# encode another document
text2 = ["the puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())

[[0 0 0 0 0 0 0 1]]


## 2. Word Frequencies with TfidfVectorizer
* TF-IDF - Term Frequency - Inverse Document Frequency.
* **Term Frequency** - This summarizes how often a given word appears within a document.
* **Inverse Document Frequency** - This downscales words that appear a lot across documents.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog.","The dog","The fox"]

# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

# encode document
vector = vectorizer.transform([text[0]])

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


## 3. Hashing with HashingVectorizer
* A clever work
around is to use a one way hash of words to convert them to integers. The clever part is that
no vocabulary is required and you can choose an arbitrary-long fixed length vector.

In [2]:
from sklearn.feature_extraction.text import HashingVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]

# create the transform
vectorizer = HashingVectorizer(n_features=20)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]


# Summary
* Discovered how to prepare text documents for ml with scikit-learn.
    * How to convert text to word count vectors with CountVectorizer.
    * 