# Agenda

1. What is machine learning?
2. What is `sklearn`, and how does it fit into this world?
3. Classification (iris)
4. Building a model
5. Fitting and predicting -- what do these mean?
6. Testing our model using split-testing
7. Testing our model is more sophisticated ways
8. Building other models and testing them -- and comparing them

# What is data science?

I personally say that data science has three big parts:

1. Data engineering -- getting the data from its original locations into a format that you can use in a serious, practical way.
2. Data analytics -- take data that describes the past, and understand that past -- how many people bought my courses? How many people used my Web site? How many people, on Sunday at 12 noon, are buying my product? How many people, at a given time of day, are trying (unsuccessfully) to log into my system?
3. Machine learning -- let's take data that we already have and use it to make predictions about the future. Machine learning has many, many uses and applications.

# Types of machine learning

- Supervised learning
    1. Classification -- given an item and several categories, how would we categorize this item?
        - Spam or not spam?
        - Insurance -- accept or reject?
        - Credit-card applications and also purchases
    2. Regression -- given some data, what number would we associate with it?
        - How much do we think a certain stock will be worth?
        - Predict scores?
- Unsupervised learning
    3. Clustering (Automatic classification)
    4. Dimensionality reduction (Parameter simplification)

# What is sklearn?

Python's "scipy stack" includes a bunch of different packages:

- NumPy, which provides us with fast, efficient numeric calculations
- Pandas, which is an easier to use version of NumPy
- SciPy, which provides us with a bunch of useful functionality for statistics, etc.
- SciPy has the ability to use plugins, known as "scikits," and one of those is scikit-learn, for machine learning.

You can install those with `pip`:

    pip install -U numpy pandas sklearn matplotlib

# What will we do?

- We'll get some data
- We'll teach the computer which data goes into which category (supervised learning)
- Then we'll show some new data to the computer
- It'll put that data into the right category

# Iris dataset

The most famous dataset in data science is the "iris" data set. It contains measurements of 150 different types of irises (purple flowers). Each flower has been measured in four different ways:

- Petal length + width 
- Sepal length + width

Based on these four measurements, can we predict which of three types of irises we have?

# Terminology

We're going to create a model. That model will be trained with our iris data. Then we'll be able to ask the model for a prediction.

- Inputs to the model can be called X, independent variables, inputs.
- Outputs from the model can be called y, dependent variables, outputs, target.

We're going to take some iris data, and train a model with it.

In [1]:
# sklearn comes with a bunch of sample data sets

from sklearn.datasets import load_iris   # this imports a function that will then load the data set

iris = load_iris()    # now I've loaded the data set into the "iris" variable

In [3]:
# what does it contain?

type(iris)   # this is a Bunch object, which is designed for sklearn sample data usage. It's basically a module with standardized attribute names

sklearn.utils._bunch.Bunch

In [4]:
dir(iris)   # what attributes are in this bunch?

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

In [5]:
# let's look at the description of this model

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [6]:
iris.data  # NumPy array of 150x4

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [7]:
iris.data_module   # what is the (string) name of the module from which we read it?

'sklearn.datasets.data'

In [8]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [9]:
# where are the outputs? Those are on iris.target
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [10]:
iris.target_names  # here are the names for these three classification numbers

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [11]:
dir(iris)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

In [12]:
iris.frame

In [13]:
# Modern versions of sklearn will produce (not just work with) Pandas data frames

iris = load_iris(as_frame=True)

In [14]:
iris.data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [16]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [17]:
# the "frame" attribute contains a data frame with *all* of the columns -- inputs and outputs
iris.frame

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


# Let's create a model!

People talk nonstop about "algorithms." An algorithm is a recipe for what code should execute, an extended formula or set of formulas. Choosing an algorithm is certain important, because each one will figure out the connections between the inputs and the output in a different way. 

But an algorithm is not a model! A model is an algorithm + data.  When we want to create 