# Big Data Strategies

Often, you'll want to take advantage of more computing power or you may have data that either has too many features or too many samples (or too many of both) to fit in memory.

Scikit-learn provides a few options for these cases.

## Streaming Instances

Many of the scikit-learn estimators provide a `fit_partial` method. You can read samples in to memory one-at-a-time or in batches. Some pseudo-code for this may look something like

```python
estimator = EstimatorPipeline()

with open('data') as file_handle:
    chunk = file_handle.read(N_LINES)
    X, y = create_matrices(chunk)
    estimator.fit_partial(X, y)
```

## Feature Extraction

Recall the bag of words representation for text data.

In [None]:
from IPython.display import SVG

SVG("images/bag_of_words.svg")

Recall that this requires having a dictionary of *all* of the words that you may encounter. When the set of features is not known in advance, for example, like when working with text data, you might try to extract them by making a full pass over the data. If this isn't feasible, you might try turning to the `hashing trick`.

In [None]:
SVG("images/hashing_vectorizer.svg")

### Aside: Hash Functions 

A hash function maps from an arbitrary size to a fixed size.

In [None]:
SVG("https://upload.wikimedia.org/wikipedia/commons/7/71/Hash_table_4_1_1_0_0_0_0_LL.svg")

Python has a built-in function for this called `hash`. (You will **not** see the same numbers. Hashing in recent versions of Python is randomized for security.)

In [None]:
hash('a')

In [None]:
hash('cuckoo')

In [None]:
hash('cuckoo')

Scikit-Learn provides some performant hashing functions.

In [None]:
from sklearn.utils.murmurhash import murmurhash3_bytes_u32

In [None]:
for word in "cuckoo goes the cuckoo clock".encode("utf-8").split():
    word_hash = murmurhash3_bytes_u32(word, 0) % 2 ** 20
    print(f"{word} {word_hash}")

We now have a mapping from words to an index that is stateless and the dimensionality of the output space is fixed in advance. `2 ** 20` here means roughly 1M features.

But, wait, if you're mapping from a potentially unbounded domain to a fixed range won't you have *collisions*. In principle, yes. In practice, rarely. YMMV.

### Aside: Hash Tables

Hash tables are a ubiquitous data structure when you start looking for them. The most ready example is a dictionary (or a set) in Python. This mapping data structure, formally an associative array, is often called a hash in other languages. The great thing about hash tables is that lookup and insertion is (almost always) $\mathcal{O}(1)$

In [None]:
SVG("https://upload.wikimedia.org/wikipedia/commons/7/7d/Hash_table_3_1_1_0_1_0_0_SP.svg")

In [None]:
dictionary = {
    ['mutable', 'object']: 123
}

How can we take advantage of hash tables for machine learning? Well, for one thing, you've been taking advantage of hash tables all along. This is precisely what the feature vectorizers we have been using are doing. Scikit-learn also provides a few, convenient objects for taking advange of hashing.

`FeatureHasher` applies a function to find the column index of a feature directly.

In [None]:
from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(input_type='string')

hasher.transform([
    ['a', 'simple', 'man'],
    ['pulp', 'fiction'],
    ['pride', '&', 'prejudice']
])

`HashingVectorizer` is particularly convenient for working with documents.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

hash_vect = HashingVectorizer(n_features=12, stop_words=['a', 'the'])

hash_vect.transform([
    'The Lion King (1994)',
    'Pulp Fiction (1994)',
    'A Simple Man (2013)',
    'Pride & Prejudice (2005))'
]).A

In [None]:
hash_vect.transform(['(1994)']).A

## Out-of-core or Incremental Learning

As mentioned above, there are a number of estimators (but not all) that can be fit without holding all of the data in memory. Any estimator that implements `partial_fit`. Choosing a good size for the mini-batch that balances relevancy and memory footprint could involve some tuning.

Here is a list of incremental estimators for different tasks:

**Classification**

* sklearn.naive_bayes.MultinomialNB
* sklearn.naive_bayes.BernoulliNB
* sklearn.linear_model.Perceptron
* sklearn.linear_model.SGDClassifier
* sklearn.linear_model.PassiveAggressiveClassifier

**Regression**

* sklearn.linear_model.SGDRegressor
* sklearn.linear_model.PassiveAggressiveRegressor

**Clustering**

* sklearn.cluster.MiniBatchKMeans

**Decomposition / feature Extraction**

* sklearn.decomposition.MiniBatchDictionaryLearning
* sklearn.decomposition.IncrementalPCA
* sklearn.decomposition.LatentDirichletAllocation
* sklearn.cluster.MiniBatchKMeans

For classification, you need to tell the estimator in advance all of the classes you will be learning.

## Example

The following exercise is taken from Andreas Mueller's scikit-learn tutorial for the upcoming SciPy 2017 conference.

In [None]:
run fetch_imdb.py

## IMDb dataset

To illustrate the scalability issues of the vocabulary-based vectorizers, let's load a more realistic dataset for a classical text classification task: sentiment analysis on text documents. The goal is to tell apart negative from positive movie reviews from the Internet Movie Database (IMDb).

We're going to work with a large subset of movie reviews from the IMDb that has been collected by Maas et al.

* A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. In the proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.

This dataset contains 50,000 movie reviews, which were split into 25,000 training samples and 25,000 test samples. The reviews are labeled as either negative (neg) or positive (pos). Moreover, positive means that a movie received >6 stars on IMDb; negative means that a movie received <5 stars, respectively.

In [None]:
import os

train_path = os.path.join('..', 'data', 'IMDb', 'aclImdb', 'train')
train_pos = os.path.join(train_path, 'pos')
train_neg = os.path.join(train_path, 'neg')

fnames = [os.path.join(train_pos, f) for f in os.listdir(train_pos)] +\
         [os.path.join(train_neg, f) for f in os.listdir(train_neg)]

fnames[:3]

In [None]:
len(fnames)

The first 12,500 reviews are positive, so let's create our target.

In [None]:
y_train = np.zeros((len(fnames), ), dtype=int)
y_train[:12500] = 1
np.bincount(y_train)

To train this model, we'll used the `SGDClassifier` with a logistic cost function. SGD stands for stochastic gradient descent, and it is a workhorse method for out-of-core learning that learns the weight vectors sample by sample.

Using the below defaults, we are going to train the classifier on 25,000 random documents (with replacement).

In [None]:
from sklearn.base import clone

def batch_train(clf, fnames, labels, iterations=25, batchsize=1000, random_seed=1):
    
    vec = HashingVectorizer(encoding='latin-1')
    
    # create an index for all the reviews used below
    idx = np.arange(labels.shape[0])
    
    # performs a copy of our estimator
    c_clf = clone(clf)
    
    rng = np.random.RandomState(seed=random_seed)
    
    for i in range(iterations):
        # choose 1000 random documents to train on
        rnd_idx = rng.choice(idx, size=batchsize)
        
        documents = []
        for i in rnd_idx:
            with open(fnames[i], 'r') as f:
                documents.append(f.read())
                
        # get the labels for these documents
        batch_labels = labels[rnd_idx]
                
        # use our HashingVectorizer to transform the documents
        X_batch = vec.transform(documents)
        
        c_clf.partial_fit(X=X_batch, 
                          y=batch_labels, 
                          classes=[0, 1])
      
    return c_clf

In [None]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(loss='log', random_state=1)

sgd = batch_train(clf=sgd, 
                  fnames=fnames, 
                  labels=y_train)

We can evaluate how well we did on our holdout set.

In [None]:
from sklearn.datasets import load_files

test_path = os.path.join('..', 'data', 'IMDb', 'aclImdb', 'test')

test = load_files(container_path=(test_path),
                  categories=['pos', 'neg'])

docs_test, y_test = test['data'][12500:], test['target'][12500:]

In [None]:
vec = HashingVectorizer(encoding='latin-1')
sgd.score(vec.transform(docs_test), y_test)