# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) PCA, SVM, Pipeline
_Steven Longstreet (DC)_

This notebook will wrap up with a few advanced machine learning algorithms, as well as using a pipe to automate running models to find the best predictive capacities. 

In [None]:
#libraries for machine learning
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
#from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from matplotlib import pyplot as plt
import webbrowser

In [None]:
#import the iris data from sklearn
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df["classification"] = pd.Series(iris.target)
df.head()

In [None]:
#train-test split
X = df[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]]
y = df.classification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, 
                                                   random_state = 42)

# PCA
PCA as dimensionality reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.

This reduced-dimension dataset is in some senses "good enough" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.

In [None]:
#SCALE
#scaler
scaler = StandardScaler()

#fit on training ONLY
scaler.fit(X_train)

#transform both training and test, based on training fit
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

#PCA
#import
#from sklearn.decomposition import PCA

#instantiate = n_components keeps the top n features
pca = PCA(n_components = 2)

#fit
pca.fit(X_train)

#transform
X_train_pca = pca.transform(X_train_transformed)
X_test_pca = pca.transform(X_train_transformed)

#what happened??
print("original shape: ", X_train.shape)
print("transformed shape: ", X_train_pca.shape)

In [None]:
# Was two principal components the best choice? One way to check
pca=PCA(.95)
pca.fit(X_train_transformed)
pca.n_components_

### Support Vector Machine Classifier


- [Support Vector Machine](https://en.wikipedia.org/wiki/Support_vector_machine)

**Background** The goal of a support vector machine is to find  the optimal separating hyperplane which maximizes the margin of the training data. 

**Pros**:
    - Powerful Model
    - Popularized modern machine learning due to it's extreme power (dethroned by Deep Learning)
    - Robust to outliers
    - Uses the kernel trick 
      
**Cons**:
    - Many possible settings
    - Slow to train
    - Scale matters
    - Can be a black box (it's hard to understand how or why it makes predictions)
    - Does not provide predicted probabilities
    
**What do we know?**

- It needs training data so it's a supervised model
- We're using it for classification

**What do we not know**
- How does it make predictions?
- What is a hyperplane?

Those questions are inherently linked. A **hyperplane** is the seperation of space between our classes. If we break it down further we can better understand it.

- in one dimension, an hyperplane is called a point
- in two dimensions, it is a line
- in three dimensions, it is a plane
- in more dimensions you can call it an hyperplane

As we start to understand that concept it begs another question. I can draw alot of lines so which is the right one? That's the goal of the SVM - **Finding the optimal hyperplane**. Finding this optimal hyperplane is dependent on a few particular vectors that support its placement - or support vectors.

Support Vectors are the data points closest to the hyperplane or decision line. These are the data points that are the **most difficult** to classify. Given their proximity to the hyperplane they have direct bearing on its optimum location. Essentially Support vectors would change the elements of the training set if moved or removed and are critical elements of the training set.

![image.png](./assets/SupportVector.png)


**Now to optimize our hyperlane!**

Step 1 - Seperate the plane as far as you can from data
![Optimal Hyperplane](./assets/optimal-hyperplane.png)

Step 2 - Find the hyperplane with the largest margin

For any hyperplane we can compute the margin.
 - Find the distance between the hyperplane and the closest data point. 
 - Take that distance and double it
 
Now you have the **margin**. Basically the margin is an area where you will not find any data points. (Note: this can cause some problems when data is noisy)

![Margin](./assets/margin.png)



### Kernel Trick: For when our data isn't already linearly separable

The below picture shows the true magic behind a Support Vector Machine. The objects on the left are mapped as we'd originally find them. A full seperation would require a curve and thus more complexity than drawing a line. In a support vector machine we rearrange using a set of mathmatical functions known as **kernels**. An intuitive way to think about kernels is a similarity function. Given two objects the kernel outputs some similarity score. The simpliest example is the linear kernel or dot-product. Given two vectors, the similarity is the lenght of hte projections of one vector to another. Given a data point to classify, the decision function makes use of the kernel by comparing that data point to a number of support vectors weighted by the learned parameters. The support vectors are in the domain of that data point and along the learned parameters are found by the learning algorithm. 

![Kernels](./assets/Input_Feature.gif)

#### Parameters

>class sklearn.svm.SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)

Today we're going to adjust C. The C parameter trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly
by giving the model freedom to select more samples as support vectors. 

![BigC_LittleC](./assets/BigC_LittleC.png)

As we regularize, or penalize C, we can visualize the impact on our predictions
![Regularize](./assets/c_regulation.png)

Note: 
* Here's a good source on [understanding the math](https://www.svm-tutorial.com/2014/11/svm-understanding-math-part-1/) or [The Idiot's Guide to SVM](http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf)
* Learn more about [Kernels](https://en.wikipedia.org/wiki/Kernel_method#Mathematics:_the_kernel_trick)

In [None]:
#import
from sklearn.svm import SVC

#instantiate
svc = SVC()

#fit
svc.fit(X_train, y_train)

#predict
y_pred = svc.predict(X_test)

#score
print(svc.score(X_test, y_test))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

# Pipeline

In [None]:
#Logistic Regression, SVM, DTree Pipeline

#import
from sklearn.pipeline import Pipeline

pipe_lr = Pipeline([("StSclr", StandardScaler()),
                   ("pca", PCA(n_components = 2)),
                   ("clf", LogisticRegression(random_state = 42))])

In [None]:
#SVM Pipeline
pipe_svm = Pipeline([("scl", StandardScaler()),
                    ("pca", PCA(n_components = 2)),
                    ("clf", SVC(random_state = 42))])

In [None]:
#DT Pipeline
pipe_dt = Pipeline([("sc", StandardScaler()),
                    ("pca", PCA(n_components = 2)),
                    ("clf", DecisionTreeClassifier(random_state = 42))])

In [None]:
#Create the pipeline objects
pipelines = [pipe_lr, pipe_svm, pipe_dt]
pipe_dict = {0: "Logistic Classifier", 1: "Support Vector Machine", 
             2: "Decision Tree"}

In [None]:
#fit the pipes through automation
for pipe in pipelines: 
    pipe.fit(X_train, y_train)

In [None]:
#compare accuracies
for idx, val in enumerate(pipelines):
    print("%s pipeline test accuracy: %.3f" % (pipe_dict[idx], val.score(X_test, y_test)))

In [None]:
best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(pipelines):
    if val.score(X_test, y_test) > best_acc:
        best_acc = val.score(X_test, y_test)
        best_pipe = val
        best_clf = idx
print('Classifier with best accuracy: %s' % pipe_dict[best_clf])


# Putting it all together - DT

In [None]:
SEED = 42

pipe = Pipeline([("scl", StandardScaler()),
                ("pca", PCA(n_components = 2)),
                ("clf", DecisionTreeClassifier(random_state = SEED))])

In [None]:
#Set param range
param_range = list(range(2, 15, 1))
grid_params = [{'clf__criterion': ['gini', 'entropy'],
        'clf__min_samples_leaf': [1], #originally set to param_range but takes much longer to run
        'clf__max_depth': param_range,
        'clf__min_samples_split': param_range[1:],
        'clf__presort': [True, False]}]

In [None]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(estimator = pipe, 
                 param_grid = grid_params,
                 scoring = "accuracy",
                 cv = 5)

In [None]:
model = DecisionTreeClassifier(random_state=SEED)
model.get_params().keys()

In [None]:
gs.fit(X_train, y_train)


#url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
#webbrowser.open(url,new=1)

In [None]:
print("Best params:", gs.best_params_)

In [None]:
gs.best_estimator_

# Your turn
For this part of the lab, do the following:

In [None]:
#Import the cancer dataset from sklearn, and structure as a pandas dataframe. 


In [None]:
#test the best possible model using pipe


In [None]:
#now, put the pipe into a gridsearch
