# HashingVectorizer
CountVectorization and TfidfVectorization scheme are simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:
<ol>
<li> The larger the corpus, the larger the vocabulary will grow and hence the memory use too, 
<li> fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
<li> Building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
<li> Pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),
<li> it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.
</ol>
HashingVectorizer is stateless, meaning that you don’t have to call <b>fit</b> on it:

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
text = ["the cat is on the table",
        "the dog is on the table"]
vectorizer = HashingVectorizer(n_features=10)
vector = vectorizer.transform(text)
vector.toarray()

array([[ 0.        , -0.31622777,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.31622777, -0.63245553,  0.63245553],
       [ 0.        , -0.31622777,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.31622777, -0.63245553,  0.63245553]])

The number of values different from 0 are less than the total number of word in the document. This is due to the collision of the hash function that is used