# Iris Data Set and Sklearn

In [1]:
# load iris data set from datasets modules
from sklearn.datasets import load_iris

In [2]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

sklearn.utils.Bunch

In [3]:
# print iris data
print (iris.data[0:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


## Machine learnig terminology
* Each row is an **observation** (sample, instance, example, record)
* Each column is a **feature** (predictor, attribute, independent variable, input, regressor)

In [4]:
# Print the name of the features
print (iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [5]:
# Print integers representing the species of each observation
print (iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [6]:
# Print the encoding scheme for species 
print (iris.target_names)

['setosa' 'versicolor' 'virginica']


* Each predicted value is the **reponse** (target, outcome, label, dependent variable)
* **Classification** is supervised learning in which the response is categorical
* **Regression** is supervised learning in which the response is continuous

## Requirements for working with data in scikit-learn
1. Features and responses are **separate objects**
2. Features and responses should be **numeric**
3. Features and responses should be **NumPy arrays**
4. Features and responses should have **specific shapes**

In [7]:
# Check the type of features and response
print (type(iris.data))
print (type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [8]:
# check the shape of the features
print (iris.data.shape)

(150, 4)


In [9]:
# check the shape of the response
print (iris.target.shape)

(150,)


In [10]:
### store feature matrix on X
X = iris.data

# store response on vector y
y = iris.target

### Review of Iris dataset 
* 150 **observations** (iris flowers)
* 4 **features**: sepal length, sepal width, petal length, petal width
* **Response** is the iris species
* This is a **classification** problem since the response is categorical

### K-nearest neighbors (KNN) classification
1. Pick value of k (number of neighbors to take into account)
2. Search for the k observations in the data set that are 'nearest' to the measurement of the unknown iris
3. Use the most popular reponse value from the k nearest neighbors as the predicted response value for the unknown iris

### scikit-learn 4-step modeling pattern
**Step 1** Import the class to use

In [11]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2** Instantiate the estimator
- _Estimator_ is scikit-learn's term for model
- _Instantiate_ means 'make an instance of'

In [12]:
knn = KNeighborsClassifier(n_neighbors=1)

* name of the object does not matter
* can specify tuning parameter (hyperparameters) during this step
* all parameters not specified are set to default

In [13]:
# Default values
print (knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')


**Step 3** Fit the model with data (model training)
* Model is learning the relationship between X and y
* Occurs in-place (it is not necessary to assign output to a variable)

In [14]:
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

**Step 4** Predict the response for a new observation
* New observations are called 'out-of-sample' data
* use the learned information during the model training preocess

In [15]:
knn.predict([[3,5,4,2]])

array([2])

2 means 'virginica'

* returns NumPy array
* Can predict for multiple observations at once

In [16]:
X_new = [[3,5,4,2],[5,4,3,2]]
knn.predict(X_new)

array([2, 1])

* First unknown is 'virginica'
* Second unknown is 'versicolor'

### Using different value of K

In [17]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X,y)
knn.predict(X_new)

array([1, 1])

Both uknowns are 'versicolor'

###  Using a different classification model

In [21]:
# import class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using default parameters)
#logreg = LogisticRegression(solver='lbfgs', multi_class='auto')
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X,y)

# Predict the response for new observations
logreg.predict(X_new)



array([2, 0])

* First unknown 'virginica'
* second unknown 'setosa'