**Modeling and Prediction**
- Training,
- Performance Metrics,
- Evaluation,
- Deployement.

**Training**
- Training is the process of making the system able to learn.
- **Modeling , Evaluation and Prediction** need four datasets: 
- ***training features,*** 
- ***testing features,*** 
- ***training variables,*** 
- ***and testing variables.*** 

# Classification

- **1) K-nearest neighbors (KNN) classification**
- **2) scikit-learn and steps for **model training and prediction** 
- **3) apply  to other machine learning models**

In [28]:
# Load libraries
import numpy
from matplotlib import pyplot
from pandas import read_csv
from pandas import set_option
#from pandas.tools.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

## Reviewing the iris dataset

- 150 **observations**
- 4 **features** (sepal length, sepal width, petal length, petal width)
- **Response** variable is the iris species
- **Classification** problem since response is categorical
- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

## Loading the data

In [4]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

In [5]:
# print the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


## scikit-learn 4-step modeling pattern

**Step 1:** Import the class you plan to use

In [6]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In [7]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

In [8]:
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')


**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [9]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

In [34]:
knn.score(X, y)

0.9666666666666667

In [10]:
knn.predict([[3, 5, 4, 2]])

array([2])

- Returns a NumPy array
- Can predict for multiple observations at once

In [11]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)

array([2, 1])

In [40]:
from sklearn.metrics import accuracy_score

y_pred = knn.predict(X)
print('Accuracy: %.2f' % accuracy_score(y, y_pred))

Accuracy: 0.97


## Using a different value for K

In [35]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

array([0, 1, 1, 0])

In [36]:
# %load solutions/05A_knn_with_diff_k.py
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=1234,
                                                    stratify=y)

X_trainsub, X_valid, y_trainsub, y_valid = train_test_split(X_train, y_train,
                                                            test_size=0.5,
                                                            random_state=1234,
                                                            stratify=y_train)

for k in range(1, 20):
    knn = KNeighborsClassifier(n_neighbors=k)
    train_score = knn.fit(X_trainsub, y_trainsub).\
        score(X_trainsub, y_trainsub)
    valid_score = knn.score(X_valid, y_valid)
    print('k: %d, Train/Valid Acc: %.3f/%.3f' %
          (k, train_score, valid_score))


knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
print('k=9 Test Acc: %.3f' % knn.score(X_test, y_test))


k: 1, Train/Valid Acc: 1.000/0.946
k: 2, Train/Valid Acc: 1.000/0.964
k: 3, Train/Valid Acc: 1.000/0.946
k: 4, Train/Valid Acc: 1.000/0.964
k: 5, Train/Valid Acc: 1.000/0.929
k: 6, Train/Valid Acc: 1.000/0.929
k: 7, Train/Valid Acc: 1.000/0.929
k: 8, Train/Valid Acc: 1.000/0.929
k: 9, Train/Valid Acc: 1.000/0.929
k: 10, Train/Valid Acc: 1.000/0.946
k: 11, Train/Valid Acc: 1.000/0.946
k: 12, Train/Valid Acc: 1.000/0.964
k: 13, Train/Valid Acc: 1.000/0.929
k: 14, Train/Valid Acc: 1.000/0.946
k: 15, Train/Valid Acc: 1.000/0.929
k: 16, Train/Valid Acc: 1.000/0.929
k: 17, Train/Valid Acc: 1.000/0.929
k: 18, Train/Valid Acc: 0.964/0.946
k: 19, Train/Valid Acc: 0.964/0.929
k=9 Test Acc: 0.974


In [21]:
X_new = [[5, 3, 2, 1], [5, 4, 3, 2], [4, 1, 3, 2], [4, 5, 3, 2]]
knn.predict(X_new)

array([0, 1, 1, 0])

In [38]:
from sklearn.metrics import accuracy_score

y_pred = knn.predict(X_test)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

Accuracy: 0.97


## Using a different classification model
### LogisticRegression

In [22]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response for new observations
logreg.predict(X_new)



array([0, 0, 2, 0])

In [23]:
# fit the model with data
logreg.fit(X_train, y_train)

# predict the response for new observations
logreg.predict(X_new)



array([0, 0, 2, 0])

### linear model classifier : SGDClassifier

In [24]:
# create the linear model classifier
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()
# fit the model with data
clf.fit(X, y)

# predict the response for new observations
clf.predict(X_new)

array([0, 0, 2, 0])

In [25]:
# fit the model with data
clf.fit(X_train, y_train)

# predict the response for new observations
clf.predict(X_new)

array([0, 0, 2, 0])

In [26]:
from sklearn.svm import SVC
SVM = SVC()
# fit the model with data
SVM.fit(X, y)

# predict the response for new observations
SVM.predict(X_new)



array([0, 1, 2, 0])

In [29]:
LDA= LinearDiscriminantAnalysis()
# fit the model with data
LDA.fit(X, y)

# predict the response for new observations
LDA.predict(X_new)

array([0, 1, 2, 1])

In [30]:
CART= DecisionTreeClassifier()
# fit the model with data
CART.fit(X, y)

# predict the response for new observations
CART.predict(X_new)

array([1, 1, 2, 1])

In [31]:
NB= GaussianNB()
# fit the model with data
NB.fit(X, y)

# predict the response for new observations
NB.predict(X_new)

array([1, 1, 1, 2])

In [32]:
GBCE = GradientBoostingClassifier()
# fit the model with data
GBCE.fit(X, y)

# predict the response for new observations
GBCE.predict(X_new)

array([1, 1, 2, 1])

In [33]:
RF= RandomForestClassifier()
# fit the model with data
RF.fit(X, y)

# predict the response for new observations
RF.predict(X_new)



array([0, 1, 2, 1])

# compare and evaluate models

* Train and test on the entire / sample  dataset
* compute classification accuracy for the  models: logreg , KNN with( K= 1 , 5 , 10 or 20 ,..) 
* Known as **training accuracy** when you train and test the model on the same data
* Train/test split: 
1) Train the model on the training set.
2) Test the model on the testing set, and evaluate how well we did.
* re-compute classification accuracy for the  models: logreg , KNN with( K= 1 , 5 , 10 or 20 ,..) 
* search an even better value for K, visualize result with plot ( accuracy vs K)
* Making predictions on out-of-sample data with models :
KNeighborsClassifier(n_neighbors=k), 
LogisticRegression(),
 SVC(),
 DecisionTreeClassifier()...etc




train the model with X and y (not X_train and y_train)
* model.fit(X, y)

make a prediction for an out-of-sample observation
* X_new = [[3, 5, 4, 2], [5, 4, 3, 2], [,,,], ...]
* model.predict(X_new)

# More Evaluation Model

http://localhost:8888/notebooks/Downloads/data-sc-BOOK/machine_learning_mastery_with_python_code/chapter_19/iris-classification.ipynb

http://localhost:8888/notebooks/1Evoa/scikit-learn-master/09_classification_metrics-Copy1.ipynb