# Loading External Datasets in Python

Thanks to the SciPy community, there are tons of resources out there for getting your hands on some data.

A particularly useful resource comes in the form of the sklearn.datasets package of scikit-learn. This package comes pre-installed with some small datasets that do not require to download any files from external websites. These datasets include:
- <tt>load_boston</tt>: The Boston dataset contains housing prices in different suburbs of Boston, along with a number of interesting features such as per capita crime rate by town, proportion of residential land and non-retail business, etc.
- <tt>load_iris</tt>: The Iris dataset contains three different types of iris flowers (setosa, versicolor, and virginica), along with four features describing the width and length of the sepals and petals.
- <tt>load_diabetes</tt>: The diabetes dataset lets you classify patients as having diabetes or not, based on features such as patient age, sex, body mass index, average blood pressure, and six blood serum measurements.
- <tt>load_digits</tt>: The digits dataset contains 8x8 pixel images of digits 0-9.
- <tt>load_linnerud</tt>: The Linnerud dataset contains three physiological and three exercies variables measured on twenty middle-aged men in a fitness club.

Even better, scikit-learn allows you to download datasets directly from external repositories, such as:
- <tt>fetch_olivetti_faces</tt>: The Olivetta face dataset contains of ten different images each of 40 distinct subjects.
- <tt>fetch_20newsgroups</tt>: The 20 newsgroup dataset contains around 18,000 newsgroups post on 20 topics.

Even better, it is possible to download datasets directly from the machine learning database at http://mldata.org. 

For example, to download the MNIST dataset of handwritten digits, simply type:

In [5]:
!pip install numpy matplotlib scikit-learn pandas datasets



In [7]:
from sklearn import datasets

In [9]:
datasets.load_digits()

{'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  2., ..., 12.,  0.,  0.],
        [ 0.,  0., 10., ..., 12.,  1.,  0.]]),
 'target': array([0, 1, 2, ..., 8, 9, 8]),
 'frame': None,
 'feature_names': ['pixel_0_0',
  'pixel_0_1',
  'pixel_0_2',
  'pixel_0_3',
  'pixel_0_4',
  'pixel_0_5',
  'pixel_0_6',
  'pixel_0_7',
  'pixel_1_0',
  'pixel_1_1',
  'pixel_1_2',
  'pixel_1_3',
  'pixel_1_4',
  'pixel_1_5',
  'pixel_1_6',
  'pixel_1_7',
  'pixel_2_0',
  'pixel_2_1',
  'pixel_2_2',
  'pixel_2_3',
  'pixel_2_4',
  'pixel_2_5',
  'pixel_2_6',
  'pixel_2_7',
  'pixel_3_0',
  'pixel_3_1',
  'pixel_3_2',
  'pixel_3_3',
  'pixel_3_4',
  'pixel_3_5',
  'pixel_3_6',
  'pixel_3_7',
  'pixel_4_0',
  'pixel_4_1',
  'pixel_4_2',
  'pixel_4_3',
  'pixel_4_4',
  'pixel_4_5',
  'pixel_4_6',
  'pixel_4_7',
  'pixel_5_0',
  'pixel_5_1',
 

In [11]:
mnist = datasets.load_digits()

Note that this might take a while, depending on your internet connection.

The MNIST database contains a total of 70,000 examples of handwritten digits (28x28 pixel images, labeled from 0 to 9). Data and labels are delivered in two separate containers, which we can inspect as follows:

In [14]:
mnist.data.shape

(1797, 64)

In [16]:
mnist.target.shape

(1797,)

Here, we can see that mnist.data contains 70,000 images of 28 x 28 = 784 pixels each.
Labels are stored in `mnist.target`, where there is only one label per image.

We can further inspect the values of all targets, but we don't just want to print them all.
Instead, we are interested to see all distinct target values, which is easy to do with NumPy:

In [19]:
import numpy as np

In [21]:
np.unique(mnist.target)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [23]:
!pip install datasets



In [24]:
from datasets import load_dataset

ds = load_dataset("ylecun/mnist")

In [26]:
ds

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 60000
    })
    test: Dataset({
        features: ['image', 'label'],
        num_rows: 10000
    })
})

In [29]:
type(ds['train']['image'][100])

PIL.PngImagePlugin.PngImageFile

In [30]:
import PIL

In [33]:
i = np.array(ds['train']['image'][100])

In [34]:
i.shape()

TypeError: 'tuple' object is not callable

In [36]:
i.shape

(28, 28)

In [38]:
i

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  