Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

In [5]:
%matplotlib inline
import numpy as np
X = np.random.random((100,4))
X.shape

(100, 4)

X is a set of 100 instances/samples having 4 features.

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. In this case we may use scipy.sparse matrices instead of numpy arrays so as to make the data fit in memory.

In [6]:
from sklearn.datasets import load_iris
iris = load_iris()

In [7]:
X = iris.data
X.shape

(150, 4)

The dataset iris has 150 samples with 4 features

In [8]:
Y = iris.target
Y.shape

(150,)

Y is the set of target labels. n_labels = 150

In [13]:
# Number of labels = Number of samples
Y.shape[0] == X.shape[0]

True

In [16]:
len(iris.target) == len(X)

True

In [17]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [20]:
# iris.target_names is a numpy array of label names
list(iris.target_names)

['setosa', 'versicolor', 'virginica']

In [22]:
X[:3]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2]])

The simple way to turn **categorical feature** into numerical features suitable for machine learning is to create new features for each distinct color name that can be valued to 1.0 if the category is matching or 0.0 if not.

The enriched iris feature set would hence be in this case:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
color#purple (1.0 or 0.0)
color#blue (1.0 or 0.0)
color#red (1.0 or 0.0)

Extracting features from unstructured data
http://www.astroml.org/sklearn_tutorial/general_concepts.html

Supervised - Classification (discrete labels) and Regression (continuous labels)
Unsupervised - Exploratory. tasks such as dimensionality reduction, clustering, and density estimation. visualize the four-dimensional dataset in two dimensions, etc
