In [38]:
from sklearn.datasets import load_iris
iris = load_iris()

First we imported the data.

In [39]:
# Inputs (data): X
X = iris.data
# Outputs (target): y
y = iris.target
# Each column in X is a "feature"
feature_names = iris.feature_names
target_names = iris.target_names

Notice how this is supervised learning, where we have labels and we have the answers to our data. We need to make a function that gives us a desired output (model) which the machine will be able to do.

In [40]:
type(X)

numpy.ndarray

In [41]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

print(X_train.shape)
print(X_test.shape)

(90, 4)
(60, 4)


We split our data into training set and test set.

`train_test_split` splits arrays or matrices into random train and test subsets. We return a dimensionality of the array with `.shape`. We can see the number of rows change based on the `test_size` to easily try different train and test data sizes. Ideally we want more train data than tests to have more data for a better model.

But if our `test_size` is very small (while our train size is large), we can't be certain if our model is accurate because we only test it  a few times. So there is a trade off with test and train data, so the more data we have the more we can train models. This is an area for improvement of our model.

In [42]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

We build a model, train it, and have predictions on our test data.

Since we gave our model the `X_test` data, we already have the answers to them with `y_test` data. So now we can compare our predictions from the model with the answers.

In [43]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

0.9666666666666667


We check our model's output and accuracy.

If we could collect more data, then more features (columns) would help provide our machines more information that may help build a better model.

We could use other algorithms if we want.

In [45]:
from joblib import dump, load
dump(knn, "mlbrain.joblib")

In [46]:
model = load("mlbrain.joblib")
model.predict(X_test)
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
predictions = model.predict(sample)
pred_species = [iris.target_names[p] for p in predictions]
print(pred_species)

array([2, 0, 2, 2, 1, 2, 0, 2, 0, 2, 1, 1, 1, 2, 2, 1, 2, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 2, 1, 1, 1, 2, 1, 1, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 2, 2, 1, 0, 2, 2, 2, 2, 2, 1, 0])

**Model persistence** allows us to run and **save the model** so that we don't have to train/test it like in production. We can use `joblib` to save (dump) the model and load the model for use. Whenever we want to improve a model, we can just dump it again.

There are a lot of tools for machine learning. To do good machine learning, we need big data sets.

We can create **custom models** like with TensorFlow and a Machine Learning Engine.

Companies provide **retrainable models** like AutoML Vision, AutoML Natural Language, AutoML Translation.

Companies also provide **pre-trained models** like Vision API, Speech API, Jobs API, Natural Language API, Translation API, Video Intelligence API.

The value is not necessarily in the model, but the data that we use in the model.