### Scalability Issues

The `sklearn.feature_extraction.text.CountVectorizer` and `sklearn.feature_extraction.text.TfidfVectorizer` classes suffer from a number of scalability issues that all stem from internal usage of the `vocabulary_` attribute. 

The main scalability issues are:

* **Memory usage of the text vectorizer**
* **Parallelization problems for text feature extraction** 
* **Impossibility to do online or out-of-core / streaming learning**

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1)

vectorizer.fit([
        "The cat sat on the mat.",
    ])
vectorizer.vocabulary_

{u'cat': 0, u'mat': 1, u'on': 2, u'sat': 3, u'the': 4}

In [2]:
# The vocabulary is used in transform to build the occurance matrix

X = vectorizer.transform([
        'The cat sat on the mat.',
        'This cat is a nice cat.',
    ]).toarray()

print(len(vectorizer.vocabulary_))
print(vectorizer.get_feature_names())
print(X)

5
[u'cat', u'mat', u'on', u'sat', u'the']
[[1 1 1 1 2]
 [2 0 0 0 0]]


In [3]:
# refit with a larger corpus

vectorizer = CountVectorizer(min_df=1)

vectorizer.fit([
        'The cat sat on the math.', 
        'The quick brown fox jumps over the lazy dog.',
    ])
vectorizer.vocabulary_

{u'brown': 0,
 u'cat': 1,
 u'dog': 2,
 u'fox': 3,
 u'jumps': 4,
 u'lazy': 5,
 u'math': 6,
 u'on': 7,
 u'over': 8,
 u'quick': 9,
 u'sat': 10,
 u'the': 11}

The `vocabulary_` is logarithmically growing with the size of the training corpus.

With this new vocabulary, the dimensionality of the output space is now larger.

In [4]:
X = vectorizer.transform([
    "The cat sat on the mat.",
    "This cat is a nice cat.",
]).toarray()

print(len(vectorizer.vocabulary_))
print(vectorizer.get_feature_names())
print(X)

12
[u'brown', u'cat', u'dog', u'fox', u'jumps', u'lazy', u'math', u'on', u'over', u'quick', u'sat', u'the']
[[0 1 0 0 0 0 0 1 0 0 1 2]
 [0 2 0 0 0 0 0 0 0 0 0 0]]
