### General dataset API

* 3 interfaces
* 1) sample images = tuple (X,y); X = array(#samples,#features); y = array_of_targets(size=#samples)
* 2) toy, "real world" & mldata.org datasets = dictionary of 2a) array(#samples,#features) + array_of_targets

### Toy dataset loaders

* Too small to be considered real-world datasets

[boston](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston) | [iris](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris) | [diabetes](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes) | [digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits) | [linnerud](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud)


### Sample image loaders

* sample JPEG images, Creative Commons license

[all](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_sample_images.html#sklearn.datasets.load_sample_images) | [by_name](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_sample_image.html#sklearn.datasets.load_sample_image) | [demo](plot_color_quantization.ipynb)

### Random sample set generators

** Classification** 

[make_blobs](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs) - controls centers & std devs of clusters

[make_classification](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification) - introduces noise

[make_gaussian_quantiles](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_gaussian_quantiles.html#sklearn.datasets.make_gaussian_quantiles) - divides single Gaussian into classes separated by concentric hyperspheres

[make_hastie_10_2](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_hastie_10_2.html#sklearn.datasets.make_hastie_10_2) - generates data for binary classification

[make_cirlces](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles), [make_moons](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html#sklearn.datasets.make_moons)  - generates 2D binary classification datasets

** Classification (Multilabel) **

[make_multilabel_classification](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_multilabel_classification.html#sklearn.datasets.make_multilabel_classification) | [demo](plot_random_multilabel_dataset.ipynb)

** BiClustering **

[make_biclusters](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_biclusters.html#sklearn.datasets.make_biclusters) - generate array with constant block diagonal struct for biclustering

[make_checkerboard](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_checkerboard.html#sklearn.datasets.make_checkerboard) - generate array for block checkboard struct for biclustering

** For Regression: **

[make_regression](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression)

[make_sparse_uncorrelated](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_sparse_uncorrelated.html#sklearn.datasets.make_sparse_uncorrelated)

[make_friedman1](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html#sklearn.datasets.make_friedman1) - polynomial & since tranforms

[make_friedman2](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman2.html#sklearn.datasets.make_friedman2) - feature multiplication & reciprocation

[make_friedman3](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman3.html#sklearn.datasets.make_friedman3) - arctan transformation

** Manifolds **

[make_s_curve](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_s_curve.html#sklearn.datasets.make_s_curve)

[make_swiss_roll](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_swiss_roll.html#sklearn.datasets.make_swiss_roll)

** Decomposition **

[make_low_rank_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_low_rank_matrix.html#sklearn.datasets.make_low_rank_matrix)

[make_sparse_coded_signal](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_sparse_coded_signal.html#sklearn.datasets.make_sparse_coded_signal)

[make_spd_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_spd_matrix.html#sklearn.datasets.make_spd_matrix)

[make_spase_spd_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_sparse_spd_matrix.html#sklearn.datasets.make_sparse_spd_matrix)

### SVMlight / Libsvm format

* <label> <feature#>:<value>,... suitable for sparse datasets

In [2]:
from sklearn.datasets import load_svmlight_file
#X_train, y_train = load_svmlight_file(
#    "/path/to/train_dataset.txt")

# load 2+ datasets at once
#X_train, y_train, X_test, y_test = load_svmlight_files(
#    ("/path/to/train_dataset.txt",
#     "/path/to/test_dataset.txt"))

# fix #features
#X_test, y_test = load_svmlight_file(
#    "/path/to/test_dataset.txt", 
#    n_features=X_train.shape[1])

### External dataset loaders

* CSV, Excel, JSON, SQL -- [pandas.io](http://pandas.pydata.org/pandas-docs/stable/io.html)

* Binary formats, .mat, .arff, etc -- [scipy.io](http://docs.scipy.org/doc/scipy/reference/io.html)

* Columnar data => numpy arrays -- [numpy/routines.io](http://docs.scipy.org/doc/numpy/reference/routines.io.html)

* Directories of text files (dirname = category name, file = sample of category) -- [load_files](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html)

* Images -- [skimage.io](http://scikit-image.org/docs/dev/api/skimage.io.html), [imageio](http://imageio.readthedocs.io/en/latest/userapi.html)

* Images w/ pixel intensities -- [imread](http://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.imread.html#scipy.misc.imread) -- requires [pillow](https://pypi.python.org/pypi/Pillow)

* Audio (WAV) files -- [scipy](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.wavfile.read.html)

* Category data stored as strings (common in Pandas) -- convert to one-hot variables using [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

* Best practice: optimized file format such as HDF5.

### Datasets

[Olivetti faces](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html#sklearn.datasets.fetch_olivetti_faces) -- 10 images (64x64) x 40 subjects; quantized to 256 grey levels, stored as 8b integers; loader converts to floating point along [0,1]

In [3]:
from sklearn import datasets
faces = datasets.fetch_olivetti_faces()

### Newsgroups

* 18000 posts x 20 topics => 1 training subset, 1 testing subset (split is date-based)

[fetch_20newsgroups](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups) -- returns list of raw texts

[fetch_20newsgroups_vectorized](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups_vectorized.html#sklearn.datasets.fetch_20newsgroups_vectorized) -- returns "ready-to-use" (feature extractor not needed)

In [4]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

from pprint import pprint
pprint(list(newsgroups_train.target_names))

pprint(newsgroups_train.filenames.shape)
pprint(newsgroups_train.target.shape)
pprint(newsgroups_train.target[:10])

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
(11314,)
(11314,)
array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])


In [5]:
#load subset of categories

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

list(newsgroups_train.target_names)

pprint(newsgroups_train.filenames.shape)
pprint(newsgroups_train.target.shape)
pprint(newsgroups_train.target[:10])

(1073,)
(1073,)
array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])


In [6]:
# to text into TF-IDF vectors, from a subset

from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
pprint(vectors.shape)

(2034, 34118)


In [7]:
# extracted vectors should be very sparse
vectors.nnz / float(vectors.shape[0])

159.0132743362832

In [8]:
# filtering text for more realistic training

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
newsgroups_test = fetch_20newsgroups(subset='test',
                                     categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)

clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
pred = clf.predict(vectors_test)

pprint(metrics.f1_score(
        newsgroups_test.target, 
        pred, 
        average='macro'))

0.88213592402729568


In [9]:
# show most informative features

import numpy as np
def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))

show_top10(clf, vectorizer, newsgroups_train.target_names)

alt.atheism: edu it and in you that is of to the
comp.graphics: edu in graphics it is for and of to the
sci.space: edu it that is in and space to of the
talk.religion.misc: not it you in is that and to of the


In [10]:
# remove headers, signature blocks, quote blocks
# see how f-score goes down

newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)

pprint(metrics.f1_score(pred, 
                        newsgroups_test.target, 
                        average='macro'))


0.77310350681274775


In [11]:
# remove metadata -- f-score declines further

newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
vectors = vectorizer.fit_transform(newsgroups_train.data)
clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)

pprint(metrics.f1_score(newsgroups_test.target, 
                        pred, 
                        average='macro'))

0.76995175184521725


### MLdata.org

In [17]:
#fetch MNIST digit recognition dataset
# 70K samples, 28x28 pixels each, labeled with 0-9
# default data_home = ~/scikit_learn_data/

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home="mnist.data")

pprint(mnist.data.shape)
pprint(mnist.target.shape)
pprint(np.unique(mnist.target))

(70000, 784)
(70000,)
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])


In [19]:
#mldata usually shaped as (#features,#samples)
# opposite of scikit convention
# scikit transposes matrix by default

iris = fetch_mldata('iris', data_home="custom_data_home_test")
pprint(iris.data.shape)

iris = fetch_mldata('iris', transpose_data=False,
                    data_home="custom_data_home_test")
pprint(iris.data.shape)

(150, 4)
(4, 150)


### Faces in the wild

[fetch_people](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html#sklearn.datasets.fetch_lfw_people)

[fetch_pairs](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_pairs.html#sklearn.datasets.fetch_lfw_pairs) - data divided into training, devt, test and "10_folds" eval sets

* auto-download, cache, parse metadata, decode jpeg, convert slices
* into memmapped numpy arrays 
* dataset size >200MB
* [paper](http://vis-www.cs.umass.edu/lfw/lfw.pdf) | [demo](face_recognition.ipynb)

In [21]:
from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

for name in lfw_people.target_names:
    pprint(name)

'Ariel Sharon'
'Colin Powell'
'Donald Rumsfeld'
'George W Bush'
'Gerhard Schroeder'
'Hugo Chavez'
'Tony Blair'


In [23]:
# default slice
pprint(lfw_people.data.dtype)
pprint(lfw_people.data.shape)
pprint(lfw_people.images.shape)
pprint(lfw_people.target.shape)
pprint(lfw_people.target[:10])

dtype('float32')
(1288, 1850)
(1288, 50, 37)
(1288,)
array([5, 6, 3, 1, 0, 1, 3, 4, 3, 0])


In [24]:
# for face verification: 
# each sample = pair of pictures belonging (or not) to the same person
from sklearn.datasets import fetch_lfw_pairs
lfw_pairs_train = fetch_lfw_pairs(subset='train')

list(lfw_pairs_train.target_names)
pprint(lfw_pairs_train.pairs.shape)
pprint(lfw_pairs_train.data.shape)
pprint(lfw_pairs_train.target.shape)


(2200, 2, 62, 47)
(2200, 5828)
(2200,)


### [Forest covertypes](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_covtype.html#sklearn.datasets.fetch_covtype)

* 30m x 30m forest patches, 7 covertypes (tree species), 54 features/sample


In [27]:
from sklearn.datasets import fetch_covtype

covtypes = fetch_covtype()
pprint(covtypes)

{'DESCR': 'Forest covertype dataset.\n'
          '\n'
          'A classic dataset for classification benchmarks, featuring '
          'categorical and\n'
          'real-valued features.\n'
          '\n'
          'The dataset page is available from UCI Machine Learning Repository\n'
          '\n'
          '    http://archive.ics.uci.edu/ml/datasets/Covertype\n'
          '\n'
          'Courtesy of Jock A. Blackard and Colorado State University.\n',
 'data': array([[  2.59600000e+03,   5.10000000e+01,   3.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  2.59000000e+03,   5.60000000e+01,   2.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  2.80400000e+03,   1.39000000e+02,   9.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [  2.38600000e+03,   1.59000000e+02,   1.70000000e+01, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00

### [RCV1](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_rcv1.html#sklearn.datasets.fetch_rcv1) -- Reuters corpus vol#1

* 800K Reuters stories; compressed size ~656MB
* scipy CSR sparse matrix, >800K samples, 47K features
* 1st 23K = training set, remain 780K = test set
* 0.16% of values are non-zero

In [None]:
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()

pprint(rcv1.data.shape)
pprint(rcv1.target.shape)
pprint(rcv1.sample_id[:3])
pprint(rcv1.target_names[:3].tolist())

### [Boston house prices](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston)

* 506 samples, 13 attributes
* no missing attributes
* used for regression examples

### [Wisconsin breast cancer DB](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer)

* 569 samples, 30 attributes

### [Diabetes predictor](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes)

* 442 instances, 10 predictor vals (numeric) + 1 progresson measure

### [Digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits)

* 5620 samples, 64 attributes, 8x8 image of integer [0-16] pixels

### [Iris Plants DB](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris)

* 150 samples, 4 attributes, no missing attributes

### [Linnerrud](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud)

* 20 samples, 3 attributes, no missing attributes
* exercise data (weight,waist,pulse)