source: [How to Prepare Text Data for Machine Learning with scikit-learn](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)

After completing the tutorial, you will know:
  - How to convert text to word count vectors with `CountVectorizer`.
  - How to convert text to word frequency vectors with `TfidfVectorizer`.
  - How to convert text to unique integers with `HashingVectorizzer`.

In [1]:
import sklearn

In [2]:
print(sklearn.__version__)

0.19.0


## Word counts with `CountVectorizer`

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
text = ["The quick brown fox jumped over the lazy dog."]
vectorizer = CountVectorizer()


In [10]:
vectorizer.fit(text)
print(vectorizer.vocabulary_)
vector = vectorizer.transform(text)
print(vector.shape)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
(1, 8)


An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.  
Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the `scipy.sparse` package.

In [11]:
print(type(vector))
print(vector.toarray())

<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


## Word Frequencies with `TfidfVectorizer`

An alternative is to calculate word frequencies, and by far the most popular method is called `TF-IDF`. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.
  - **Term Frequency**: This summarizes how often a given word appears within a document
  - **Inverse Document Frequency**: This downscales words that appear a lot across documents.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
text = ["The quick brown fox jumped over the lazy dog.",
        "The dog.",
        "The fox"]
vectorizer = TfidfVectorizer()
vectorizer.fit(text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [14]:
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[ 1.69314718  1.28768207  1.28768207  1.69314718  1.69314718  1.69314718
  1.69314718  1.        ]


In [15]:
# encode document
vector = vectorizer.transform([text[0]])
print(vector.shape)
print(vector.toarray())

(1, 8)
[[ 0.36388646  0.27674503  0.27674503  0.36388646  0.36388646  0.36388646
   0.36388646  0.42983441]]


## Hasing with HashingVectorizer

Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.  
The `HashingVectorizer` class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.


In [17]:
from sklearn.feature_extraction.text import HashingVectorizer

In [19]:
text = ["The quick brown fox jumped over the lazy dog."]
# encodes the sample document as a 20-element sparse array.
vectorizer = HashingVectorizer(n_features=20)
vector = vectorizer.transform(text)

In [20]:
print(vector.shape)
print(vector.toarray())

(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]
