In [1]:
# Put import statements here
from sklearn.datasets import load_iris
import random
random.seed(15)
%matplotlib inline

# DSCI6003 Lab - SVM

Today we will be implementing sklearn's version of SVM. For those of you who are curious, the SVM practicums contain more information about implementing your own version of SVM.

Today We Will:

    1. Learn how the parameters influence the decision boundary for SVM
    2. Compare and contrast an SVM to see how it differs from Logistic Regression.
    3. Using SVMs to deal with unbalaanced classes

### Part 1 - Parameter Tuning

1. Load in the rbf_data and the rbf_labels dataset using pandas (make sure to set delim_whitespace=True and header = None).
2. Plot the data using matplotlib, setting the c attribute to labels for the points. 
3. Plot gamma from [0.1,0.3,1,3,10], keeping C constant at 1. What do you notice?
4. Plot C from [1E-1,1,10,100] holding gamma constant at 3. What do you notice?
5. This may take a while, but plot gamma at 250. What do you notice? 


### Part 2 - Compare and Contrast

1. We will be using the iris dataset. Load in the iris dataset from sklearn. 
2. Make the classfication binary by changing any 2 label to a 1.
3. Plot the the third and fourth columns of the dataset (watch out for indexing!). Use plt.copper() before plt.show() to change the color of the points (or use your favorite colormap).
4. Run a Logistic Regression Classifier (LRC) on the third and fourth columns and plot the boundary. What do you notice about the boundary? (Use the function below to plot the decision boundary.
5. Now, run an SVM on the third and fourth column, and use the function below to plot the boundary. What do you notice? Which kernel can you use to correctly classify the last point?
6. Now run steps 3 - 5 again, except now you will change every 2 into a 0 rather than a one. What do you notice about the decision boudaries? 
7. To get an even better understanding of why we might prefer SVMs over LRC, load in the data_scientist.csv data. Plot it with a logistic regression and SVM decision boundary. What do you notice? 

### Part 3 - Imbalanced Classes

Let's pretend this data now corresponds to credit card fraud, where a true positive means saving thousands of dollars and maintaining customer loyalty while a false positive means us calling the customer and having them confirm that they were the ones to make the purchase (a small cost for letting fraudsters escape). How can you catch as many true positives (fraudsters) as possible? 

1. Now create variables X_small, y_small which are subsets of the iris data. You can run the "annihilate_data" function to remove the data.
2. What do the class counts look like now? Plot the data.
3. Run an LRC and plot the decision boundary. What is the behavior of the model? 
4. Now plot the decision boundary for an SVM. What is the behavior? Change the kernels. Does anything happen? 
5. Now as the data scientist, you should be able to look at documentation and figure out what the best tool for the job will be. Looking at the SVC inputs, what variable can you change to fix this problem? Plot the decision boundary after you have made this adjustment.


In [None]:
def annihilate_data(X,y,num=10):
    y_0 = len(X[y == 0])
    y_1 = len(X[y == 1])
    smaller = 0 if y_0 < y_1 else 1
    idx = np.random.choice(np.where(y == smaller)[0],size = num)
    full_idx = np.append(np.where(y != smaller)[0],idx)
    return X[full_idx],y[full_idx]

In [6]:
def decision_boundary(clf, X, Y, h=.02):
    """Inputs:
        clf - a trained classifier, with a predict method
    """
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(1, figsize=(4, 3))
    plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)

    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.show()