# Loading a Sample Dataset

In [1]:
import numpy as np

Here we are going to use popular preloaded datasets to use

In [2]:
from sklearn import datasets
digits = datasets.load_digits()
features = digits.data
target = digits.target

In [4]:
features[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

Luckily, scikit-learn comes with some common datasets we can quickly load.
These datasets are often called “toy” datasets because they are far smaller and cleaner
than a dataset we would see in the real world. Some popular sample datasets in scikitlearn
are:
load_boston
Contains 503 observations on Boston housing prices. It is a good dataset for
exploring regression algorithms.
load_iris
Contains 150 observations on the measurements of Iris flowers. It is a good dataset
for exploring classification algorithms.
load_digits
Contains 1,797 observations from images of handwritten digits. It is a good dataset
for teaching image classification.

## Creating a Simulated Dataset

scikit-learn offers many methods for creating simulated data. Of those, three methods
are particularly usefulWhen we want a dataset designed to be used with linear regression, make_regression is a good choice



In [6]:
from sklearn.datasets import make_regression

In [8]:
features,target,coefficients = make_regression(
    n_samples=100,
    n_features=3,
    n_informative=3,
    n_targets=1,
    bias=0.0,
    effective_rank=None,
    tail_strength=0.5,
    noise=0.0,
    shuffle=True,
    coef=True,
    random_state=1,
)

In [9]:
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
 [[ 1.29322588 -0.61736206 -0.11044703]
 [-2.793085    0.36633201  1.93752881]
 [ 0.80186103 -0.18656977  0.0465673 ]]
Target Vector
 [-10.37865986  25.5124503   19.67705609]


If we are interested in creating a simulated dataset for classification, we can use
make_classification:

In [10]:
# Load library
from sklearn.datasets import make_classification
# Generate features matrix and target vector
features, target = make_classification(n_samples = 100,
n_features = 3,
n_informative = 3,
n_redundant = 0,
n_classes = 2,
weights = [.25, .75],
random_state = 1)
# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
 [[ 1.06354768 -1.42632219  1.02163151]
 [ 0.23156977  1.49535261  0.33251578]
 [ 0.15972951  0.83533515 -0.40869554]]
Target Vector
 [1 0 0]


Finally, if we want a dataset designed to work well with clustering techniques, scikitlearn
offers make_blobs:

In [11]:
# Load library
from sklearn.datasets import make_blobs
# Generate feature matrix and target vector
features, target = make_blobs(n_samples = 100,
n_features = 2,
centers = 3,
cluster_std = 0.5,
shuffle = True,
random_state = 1)
# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
 [[ -1.22685609   3.25572052]
 [ -9.57463218  -4.38310652]
 [-10.71976941  -4.20558148]]
Target Vector
 [0 1 1]


## Discussion

As might be apparent from the solutions, make_regression returns a feature matrix
of float values and a target vector of float values, while make_classification and
make_blobs return a feature matrix of float values and a target vector of integers representing
membership in a class.
scikit-learn’s simulated datasets offer extensive options to control the type of data
generated. scikit-learn’s documentation contains a full description of all the parameters,
but a few are worth noting.
In make_regression and make_classification, n_informative determines the
number of features that are used to generate the target vector. If n_informative is less
than the total number of features (n_features), the resulting dataset will have redundant
features that can be identified through feature selection techniques.
In addition, make_classification contains a weights parameter that allows us to
simulate datasets with imbalanced classes. For example, weights = [.25, .75]
would return a dataset with 25% of observations belonging to one class and 75% of
observations belonging to a second class.For make_blobs, the centers parameter determines the number of clusters generated.
Using the matplotlib visualization library, we can visualize the clusters generated by
make_blobs: