In [None]:
# import load_iris function from datasets
from sklearn.datasets import load_iris

In [None]:
# save 'bunch' object containing iris datasets and its attributes
iris = load_iris()
type(iris)

In [None]:
# print iris data as sepal length, sepal width, petal length, petal width
# each row in output represent one flower
# each row is also called sample, observation, instance  or example. e.g iris dataset has
# 150 observation
# each column is a feature also known as attribute, predictor, independent varialbe, input,
# regression, covariate. E.g. iris datasets has four features
print(iris.data)

In [None]:
# print the name of four feature of iris datasets. It is also as a column header for the data
print(iris.feature_names)

In [None]:
# print integer( of target) representing the species of each observation
# here 0 = setosa, 1= versicolor, 2 = virginica
print(iris.target)

In [None]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = verginica
print(iris.target_names)

- Each Value we are predicting is the **response** (also known as : targt, outcome, label, dependent
variable
- **Classification** is supervised learning in which the response is categorical
- **Regression** is supervised learning in which the response is ordered and continuous

# Requirements for working with data in scikit-learn
1. Features and response are **separate objects**
2. Features and responses should be **numberic**
3. Features and responses should be **NumPy arrays**
4. Features and responses should have **specific shapes**


In [None]:
# check the type of the features and response
print (type(iris.data))
print(type(iris.target))

In [None]:
# check the shape of the features (first dimension = number of observation)
# here shape of iris dataset is 150X4. why? bcoz it has 150 row and 4 column
print(iris.data.shape)

In [None]:
# check the shape of the response (single dimension matching the number of observations)
# there should for one response for each observation
print(iris.target.shape)

- __iris.data__ and __iris.target__ needs scikit-learn's **four requrements** for features and response object 

In [None]:
# store feature matrix in 'X'
X = iris.data

# sotre response vector in 'Y'
y = iris.target

- here **X** is in uppercase, bcoz it represents a *matrix* and **y** is in lowercase bcoz it represents a *vector*

- Here we have 150 observations
- 4 **features** (sepal lenght, sepal width, petal lenght, petal width)
- **Response** variable is the iris species. Here 5th column is response varaible.
- **Classification** problem since response is categorical

## K-nearest neighbors classification
1. Pick a value of K. How to choose value of K?
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbours as the predicted response value for the unknown iris.


### KNN classification map (K=1)

Here, visualize your training data on a coordinate plane, with the x and y coordinates representing the feature values and the color representing the response class:

![1NN classification map](http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/04_knn_dataset-300x198.png)

### KNN classification map (K=5)

KNN can predict the response class for a future observation by calculating the "distance" to all training observations and assuming that the response class of nearby observations is likely to be similar. These predictions can be visualized using a classification map:

![5NN classification map](http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/04_1nn_map-300x198.png)

## Loading the data

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

#save 'bunch' object containing iris dataset and its attributes
iris = load_iris()

# store feature matrix in 'X'
X = iris.data

# store response vector in 'y'
y = iris.target

In [None]:
# print the shape of X and y

print(X.shape)
print(y.shape)

## scikit-learn 4-step modeling pattern

**Step 1:** Import the class you plan to use

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"

- "Estimator's" role is to estimates unknown quantities. This process is called instantiation, Becoz we are creating instance on KNeighborsClassifier class.
- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

# here *knn* is the object/instance of KNeighborsClassifier class, which knows how to do K-Nearest Classification. 
# It is just waiting for to give it some data.


- Name of the object does not matter
- n_neighbors represents the value of K. It tells the knn object that, while running it should be looking 1 nearest
neighbors
- *n_neighbors* is called tuning paramenters.
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters of *KNeighborsClassfier* which are not specified, are set to their defaults values.
- See default value in next cell

In [None]:
print(knn)

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [None]:
# fit method is used on knn object
# this operation occures in-place. We don't have to assign it's result to another object
knn.fit(X, y)

**Step 4:** Predict the response for a new observation

- *predict* method is taking measurement of unknown iris as python list, and asking the fit model to predict iris species
based on what it has learned in the previous step.
- *predict* expects a NumPy array and returns a NumPy arrary with the predicted respose value. It also accepts list bcoz, it is automatically converted to 
NumPy array of appropriate shape.
- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

here predict method gives output array[2], which means that iris-verginica is the predicted species for the given unknown iris.

In [None]:
knn.predict([3, 5, 4, 2])

- Returns a NumPy array
- Can predict for multiple observations at once

In [None]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)  # passing list of list containing two observation/measurement of unknown iris.
                    # this list of list gets converted to NumPy array of 2X4 size, means two observation
                    # of four features each.
# predicted output for two unknown iris is 2 for first and 1 for second unknown iris.

## Using a different value for K

- Changing the value of K is known as **Model-Tuning**.

In [None]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

## Using a different classification model

In [None]:
from sklearn.linear_model import LogisticRegression

# instantiate the model(using the default parameter)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

#predict the response for new observations
logreg.predict(X_new)

- Here you might be wondering, which model produce the correct prediction for these two unknown iris. The answer is , we don't know as these are out of sample observation meaning that we don't know the right response value.
- our goal with *supervised learning* is to build model that genralized to new data.