## Task
Explore loading internal & external datasets with scikit-learn

## Notebook Summary
* Load built-in datasets
* Fetch external datasets

## References
* [scikit-learn Dataset loading utilities](http://scikit-learn.org/stable/datasets/index.html)


In [2]:
# display output from all cmds just like Python shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import sklearn
print 'sklearn.version = ', sklearn.__version__
from sklearn import datasets as d


sklearn.version =  0.18.1


In [6]:
# ----------
# load built-in datasets
# ----------

boston = d.load_boston()
dir(boston)

boston.feature_names
X, y = boston.data, boston.target

print 'X: type, shape:'
type(X), X.shape

print 'Y: type, shape:'
type(y), y.shape

X[0:5,]


['DESCR', 'data', 'feature_names', 'target']

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='|S7')

X: type, shape:


(numpy.ndarray, (506, 13))

Y: type, shape:


(numpy.ndarray, (506,))

array([[  6.32000000e-03,   1.80000000e+01,   2.31000000e+00,
          0.00000000e+00,   5.38000000e-01,   6.57500000e+00,
          6.52000000e+01,   4.09000000e+00,   1.00000000e+00,
          2.96000000e+02,   1.53000000e+01,   3.96900000e+02,
          4.98000000e+00],
       [  2.73100000e-02,   0.00000000e+00,   7.07000000e+00,
          0.00000000e+00,   4.69000000e-01,   6.42100000e+00,
          7.89000000e+01,   4.96710000e+00,   2.00000000e+00,
          2.42000000e+02,   1.78000000e+01,   3.96900000e+02,
          9.14000000e+00],
       [  2.72900000e-02,   0.00000000e+00,   7.07000000e+00,
          0.00000000e+00,   4.69000000e-01,   7.18500000e+00,
          6.11000000e+01,   4.96710000e+00,   2.00000000e+00,
          2.42000000e+02,   1.78000000e+01,   3.92830000e+02,
          4.03000000e+00],
       [  3.23700000e-02,   0.00000000e+00,   2.18000000e+00,
          0.00000000e+00,   4.58000000e-01,   6.99800000e+00,
          4.58000000e+01,   6.06220000e+00,   3.000

In [12]:
# ----------
# fetch data sets
# ----------

# default data dir is ~/scikit_learn_data
d.get_data_home()
d.clear_data_home()
print '-----'

# Fetch Forest covertypes dataset
%time covtype = d.fetch_covtype()
dir(covtype)
X, y = covtype.data, covtype.target

type(X), X.shape
type(y), y.shape

print covtype.DESCR

X[:5,]
y[:5,]


'/Users/niranjan/scikit_learn_data'

-----
CPU times: user 1min 20s, sys: 1.6 s, total: 1min 21s
Wall time: 1min 26s


['DESCR', 'data', 'target']

(numpy.ndarray, (581012, 54))

(numpy.ndarray, (581012,))

Forest covertype dataset.

A classic dataset for classification benchmarks, featuring categorical and
real-valued features.

The dataset page is available from UCI Machine Learning Repository

    http://archive.ics.uci.edu/ml/datasets/Covertype

Courtesy of Jock A. Blackard and Colorado State University.



array([[  2.59600000e+03,   5.10000000e+01,   3.00000000e+00,
          2.58000000e+02,   0.00000000e+00,   5.10000000e+02,
          2.21000000e+02,   2.32000000e+02,   1.48000000e+02,
          6.27900000e+03,   1.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        

array([5, 5, 2, 2, 5], dtype=int32)