# Spot-Checking Classification Algorithms
* It is a way of discovering which algorithms perform well on our machine learning problem.
* We must trial a number of methods and focus attention on those that prove themselves the most promising.
* We must use trial and error to discover a shortlist of algorithms that do well on our problem that we can can double down on and tune further.This process is called **spot-checking**.
* The question we are going to answer is : 
    * **what algorithms should i spot-check on my dataset?** 

## Algorithms Overview
we are going to take a look at six classification algorithms that we can spot-check on our dataset.

**1. Two linear ML algoithms**
  * Logistic Regression
  * Linear Discriminant Analysis
  
**2. Four Nonlinear ML algorithms**
  * K-nearest Neighbors
  * Naive Bayes
  * Classification and Regression Trees
  * Support Vector Machines


### 1.1 Logistic Regression
* Logistic Regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems.

In [4]:
# Logistic Regression Classification

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

filename = 'pima-indians-diabetes.data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
dataframe = read_csv(filename,names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
kfold = KFold(n_splits=10,random_state=None)
model = LogisticRegression()
results = cross_val_score(model,X,Y,cv=kfold)
print(results.mean())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

0.773427887901572


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### 1.2 Linear Discriminant Analysis
* It is a statistical technique for binary and multiclass classification.
* It too assumes Gaussian distribution for the numerical input variables.



In [5]:
# LDA Classification
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model =  LinearDiscriminantAnalysis()
results = cross_val_score(model,X,Y,cv=kfold)
print(results.mean())

0.773462064251538


### 2.1 K-Nearest Neighbours
* It uses a distance metric to find the k most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.

In [6]:
## KNN classification
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
results = cross_val_score(model,X,Y,cv=kfold)
print(results.mean())

0.7265550239234451


### 2.2 Naive Bayes
* It calculates the **probability of each class** and the **conditional probability of each class given each input value**.
* These probabilities are estimated for new data and multiplied together,assuming that they are all **independent(a simple or naive assumption)**.
* When working with real-valued data,a Gaussian distribution is assumed to easily estimate the **probabilities for input variables using Gaussian Probability Density Function**.

In [7]:
## Gaussian Naive Bayes Classification
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
results = cross_val_score(model,X,Y,cv=kfold)
print(results.mean())

0.7551777170198223


### 2.3 Classification and Regression Trees
* CART construct a binary tree from the training data.
* Split points are choosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (**like Gini index**).

In [9]:
## CART Classification
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
results = cross_val_score(model,X,Y,cv=kfold)
print(results.mean())

0.6978298017771701


### 2.4 Support Vector Machines
* It seek a line that best separates two classes.
* Those data instances that are close to the line that best separates the classes are called **support vectors** and influence where the line is placed.
* Of paricular importance is the use of **different kernal functions** via the **kernel parameter**.
* A **powerful Radial Basis Function** is used by default.

In [10]:
## SVM Classification
from sklearn.svm import SVC

model = SVC()
results = cross_val_score(model,X,Y,cv=kfold)
print(results.mean())

0.7604237867395763


# Summary
* learnt how to spot-check on 6 machine learning classification algorithms.