## Introduction
The objective of this notebook is to demonstrate ``sklearn`` dataset API

Recall that it has three APIs
1. Loaders (``load_*``) load small standard datasets bundled with ``sklearn``
2. Fetchers (``fetch_*``) fetch large datasets from the internet and loads in memory.
3. Generators (``generate_``) generate controlled synthetic datasets

Loaders and fetchers return a ``bunch`` object and generators return a tuple of feature matrix and label vector(or matrix)

## Loaders

### Loading iris dataset

In [34]:
from sklearn.datasets import load_iris
data = load_iris()

This returns a ``Bunch`` object ``data`` which is a dictionary like object with following attributes
- ``data``, which has a feature matrix
- ``target``, which is the label vector
- ``feature_names`` contain names of the features
- ``target_names`` contain the names of the classes
- ``DESCR`` has the full description of the dataset
- ``filename`` has the path to the location of the data

In [4]:
type(data)

sklearn.utils.Bunch

We can access them one by one and examine their contents. For example, we can access ``feature_names`` as follows

In [35]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

We can see the names of features of this dataset.

Let's examine the names of labels.

In [36]:
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

There are three classes ``setosa, versicolor, virginica``.

The feature matrix can be accessed as follows ``data.data``. Lets take a look at first five examples in the data.

In [7]:
data.data[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

We can observe 4 features per example.

Let's examine the shape of the feature matrix

In [8]:
data.data.shape

(150, 4)

There are 150 examples and each example has 4 features.

Finally we will examine the label vector and its shape.

In [9]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

There are 50 examples each from three classes : 0,1,2

We can read additional documentation about ``load_iris`` in the following manner

In [10]:
?load_iris

[0;31mSignature:[0m [0mload_iris[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load and return the iris dataset (classification).

The iris dataset is a classic and very easy multi-class classification
dataset.

Classes                          3
Samples per class               50
Samples total                  150
Dimensionality                   4
Features            real, positive

Read more in the :ref:`User Guide <iris_dataset>`.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object. See
    below for more information about the `data` and `target` object.

    .. versionadded:: 0.18

as_frame : bool, default=False
    If True, the data is a pandas DataFrame including columns with
    appropriate dtypes (numeric). The target is
    a pandas DataFrame or Ser

In this way, we can load and examine different datasets.

We can obtain feature matrix and label or target from ``load_iris`` and other loaders in general by setting ``return_X_y`` argument to ``True``

In [14]:
feature_matrix, label_vector = load_iris(return_X_y=True)
print('Shape of feature matrix:', feature_matrix.shape)
print('Shape of label vector:', label_vector.shape)
'''
We can also view the full details of the dataset with data.DESCR
'''
data.DESCR

Shape of feature matrix: (150, 4)
Shape of label vector: (150,)




### Loading diabetes dataset

In [5]:
from sklearn.datasets import load_diabetes

Additional details about this loader can accessed from the documentation

In [6]:
?load_diabetes

[0;31mSignature:[0m [0mload_diabetes[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load and return the diabetes dataset (regression).

Samples total    442
Dimensionality   10
Features         real, -.2 < x < .2
Targets          integer 25 - 346

.. note::
   The meaning of each feature (i.e. `feature_names`) might be unclear
   (especially for `ltg`) as the documentation of the original dataset is
   not explicit. We provide information that seems correct in regard with
   the scientific literature in this field of research.

Read more in the :ref:`User Guide <diabetes_dataset>`.

Parameters
----------
return_X_y : bool, default=False.
    If True, returns ``(data, target)`` instead of a Bunch object.
    See below for more information about the `data` and `target` object.

    .. versionadded:: 0.18

as_frame : bool, default=

In [7]:
'''
Calling the loader and obtaining the Bunch Object
'''
data = load_diabetes()
print(type(data))
print(data)

<class 'sklearn.utils.Bunch'>
{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]]), 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 15

In [8]:
X,y = load_diabetes(as_frame=True,return_X_y=True)
print(type(X))
print(type(y))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [20]:
'''
Description of the Bunch object
'''
data.DESCR

'.. _diabetes_dataset:\n\nDiabetes dataset\n----------------\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\n**Data Set Characteristics:**\n\n  :Number of Instances: 442\n\n  :Number of Attributes: First 10 columns are numeric predictive values\n\n  :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n  :Attribute Information:\n      - age     age in years\n      - sex\n      - bmi     body mass index\n      - bp      average blood pressure\n      - s1      tc, total serum cholesterol\n      - s2      ldl, low-density lipoproteins\n      - s3      hdl, high-density lipoproteins\n      - s4      tch, total cholesterol / HDL\n      - s5      ltg, possibly log of serum triglycerides level\n      - s6      glu, blood sugar

In [21]:
'''
Shape of the feature matrix
'''
data.data.shape

(442, 10)

In [22]:
'''
Shape of target vector
'''
data.target.shape

(442,)

In [23]:
'''
First five examples of feature matrix
'''
data.data[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]])

In [24]:
'''
First five examples of labels
'''
data.target[:5]

array([151.,  75., 141., 206., 135.])

In [26]:
'''
Names of features
'''
data.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

## Fetchers

### fetch_openml
[openml.org](https://openml.org) is a public repository for machine learning data and experiments, that allows everybody to upload open datasets

Import the library and access the documentation

In [39]:
from sklearn.datasets import fetch_openml
?fetch_openml

[0;31mSignature:[0m
[0mfetch_openml[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mversion[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mint[0m[0;34m][0m [0;34m=[0m [0;34m'active'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_id[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mint[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_home[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtarget_column[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;34m'default-target'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcache[0m[0;34m:[0m [0mbool[0m 

Note this is an experimental API and will likely change in the future releases.
> We use this API for loading MNIST dataset

In [42]:
X, y = fetch_openml('mnist_784',version=1,return_X_y=True)
print('Feature matrix shape:', X.shape)
print('Label shape:', y.shape)

KeyboardInterrupt: 

## Generators
### ``make_regression``

In [43]:
from sklearn.datasets import make_regression
?make_regression

[0;31mSignature:[0m
[0mmake_regression[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_informative[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_targets[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbias[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0meffective_rank[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtail_strength[0m[0;34m=[0m[0;36m0.5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnoise[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcoef[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0

Example 1

Let's generate 100 samples with 5 features for a single label regression problem

In [44]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=1, shuffle=True,random_state=42)

It is a good practice to set seed so that we get to see repeatability in the experimentation

Let's look at the shapes of the feature matrix and label vector.

In [45]:
X.shape

(100, 5)

In [47]:
y.shape

(100,)

Example 2

Lets generate 100 samples with 5 features for a multi output regression problem with 5 outputs

In [49]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=5, shuffle=True,random_state=42)
print(X.shape)
print(y.shape)

(100, 5)
(100, 5)


### ``make_classification``

Generate a random n-class classification problem setup

In [50]:
from sklearn.datasets import make_classification
?make_classification

[0;31mSignature:[0m
[0mmake_classification[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_informative[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_redundant[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_repeated[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_classes[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_clusters_per_class[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mweights[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mflip_y[0m[0;34m=[0m[0;36m0.01[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclass_sep[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhypercube[0m[0;34m=[0m[0;32m

lets generate a binary classification problem with 10 features and 100 samples

In [51]:
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, n_clusters_per_class=1,random_state=42)
print(X.shape)
print(y.shape)

(100, 10)
(100,)


In [52]:
X[:5]

array([[ 0.11422765, -1.71016839, -0.06822216, -0.14928517,  0.30780177,
         0.15030176, -0.05694562, -0.22595246, -0.36361221, -0.13818757],
       [ 0.70775194, -1.57022472, -0.23503183, -0.63604713,  0.62180996,
        -0.56246678,  0.97255445, -0.77719676,  0.63240774, -0.47809669],
       [ 0.63859246,  0.04739867,  0.33273433,  1.1046981 , -0.65183611,
        -1.66152006, -1.2110162 ,  1.09821151, -0.0660798 ,  0.68024225],
       [-0.23894805, -0.97755524,  0.0379061 ,  0.19896733,  0.50091719,
        -0.90756366,  0.75539123,  0.12437227, -0.57677133,  0.07871283],
       [-0.59239392, -0.05023811,  0.17573204, -1.43949185,  0.27045683,
        -0.86399077, -0.83095012,  0.60046915,  0.04852163,  0.32557953]])

In [53]:
y[:5]

array([1, 1, 1, 1, 0])

lets generate a 3-class classification problem with 10 features and 100 samples

In [54]:
X, y = make_classification(n_samples=100, n_features=10, n_classes=3, n_clusters_per_class=1,random_state=42)
print(X.shape)
print(y.shape)

(100, 10)
(100,)


In [57]:
print(X[:5])
print("y :",y[:5])

[[-0.58351628 -1.73833907 -1.37298251 -1.77311485  0.45918008  0.83392215
  -1.66096093  0.20768769 -0.07016571  0.42961822]
 [-1.0044394  -1.43862044  0.47335819 -0.21188291  0.0125924   0.22409248
  -0.77300978  0.49799829  0.0976761   0.02451017]
 [ 0.07740833  0.19896733  0.12437227  0.17738132 -0.97755524  0.50091719
   0.75138712  0.54336019  0.09933231 -1.66940528]
 [-0.91759569 -0.9609536   1.07746664  0.4522739  -0.32138584 -0.8254972
  -0.56372455  0.24368721  0.41293145 -0.8222204 ]
 [-0.96222828 -0.96090774  1.21530116  0.55980482 -1.24778318 -0.25256815
  -1.43014138  0.13074058  1.6324113  -0.44004449]]
y : [2 0 1 0 0]


### ``make_multilabel_classification``

This function helps us generate a random multi-label classification problem

In [58]:
from sklearn.datasets import make_multilabel_classification
?make_multilabel_classification

[0;31mSignature:[0m
[0mmake_multilabel_classification[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_classes[0m[0;34m=[0m[0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_labels[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlength[0m[0;34m=[0m[0;36m50[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mallow_unlabeled[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msparse[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_indicator[0m[0;34m=[0m[0;34m'dense'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_distributions[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)

Lets generate 100 samples with 10 features, 5 labels and on average 2 labels per example.

In [60]:
X, y = make_multilabel_classification(n_samples=100,n_features=10,n_classes=5,n_labels=2)
print(X.shape)
print(y.shape)

(100, 10)
(100, 5)


### ``make_blobs``

``make_blobs`` enables us to generate random data for clustering

In [2]:
from sklearn.datasets import make_blobs
?make_blobs

[0;31mSignature:[0m
[0mmake_blobs[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcenters[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcluster_std[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcenter_box[0m[0;34m=[0m[0;34m([0m[0;34m-[0m[0;36m10.0[0m[0;34m,[0m [0;36m10.0[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_centers[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Generate isotropic Gaussian blobs for clustering.

Read more in the :ref:`User Gu

Lets generate a random dataset of 10 samples with 2 features each for clustering

In [3]:
X,y = make_blobs(n_samples=10,centers=3,n_features=2,random_state=42)
print(X.shape)
print(y.shape)

(10, 2)
(10,)


We can find the cluster membership of each point in y

In [4]:
y

array([2, 2, 1, 2, 0, 0, 0, 1, 1, 0])