In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list names of anyone you worked with on this homework.

# [ER 131] Homework 11: Support Vector Machines
----

This homework will use support vector machines to classify CalEnviroScreen data. We will take gradual steps in this homework, starting from recalling key information from lectures and textbook, to creating our own classifiers. Throughout the homework, we'll learn about the intuition behind the Perceptrons and Maximal Margin Classifiers (MMC), then move on to learning about the intuition behind support vector machines (SVMs) and applying them to CalEnviroScreen data. The textbook reference here is ISLR 9.1-9.3.


### Table of Contents

[CalEnviroScreen Data](#data)<br>
1. [Perceptrons and MMC](#perceptron)<br>
1. [SVM Intuition](#svm)<br>
1. [Using SVM to Classify CalEnviroScreen Data](#classify) <br>

**Dependencies:**

In [None]:
# Import Packages
import numpy as np
import pandas as pd
from matplotlib import style
from matplotlib import pyplot as plt

# Import Samples Generator
from sklearn.datasets.samples_generator import make_circles
from sklearn.datasets.samples_generator import make_blobs
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Color Scheme for SVM!
colormap = np.array(['blue', 'gold']) # Go bears!

# Allows us to plot SVC decision functions
def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC
    
    Variables: 
        model: classifier
    
    Usage:
    >>> from sklearn.svm import SVC
    >>> clf = SVC(kernel='linear')
    >>> clf.fit(X, y)
    >>> plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')
    >>> plot_svc_decision_function(clf) # Draw the decision boundary
    >>> plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
            s=200, facecolors='none');
    """
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

In [None]:
# run this cell
!pip install xlrd

---

### CalEnviroScreen Dataset <a id='data'></a>
Carrying on from the previous homework, we will be using the CalEnviroScreen dataset. CalEnviroScreen 3.0 is a comprehensive documentation of the environmental and the demographic conditions of each census tract in California. In this homework, we are interested in predicting environmental conditions using information related to demographics. 

Please note that the Excel file can be downloaded from [here](https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-30). However, for this homework, the Excel file has already been placed in the same directory as the homework. 

Before we get to working with the data, however, we're going to use simulated data to develop some concepts.

---
### Section 1: Perceptrons and Maximal Margin Classification  <a id='perceptron'></a>

#### What's a Perceptron?

You'll remember from the asynchronous lectures that SVM are a way to classify observations into one of two possible classes. SVMs are a pretty flexible method that can allow for non-linear splits between observations. Given a set of hyperparameters, the SVM delivers a unique solution. 

For the purposes of this homework, a perceptron is any hyperplane sitting between linearly separable data. In this way perceptrons are a little more generic than SVMs (any plane will do) and a little less flexible (the data need to be linearly separable).  

We use the following mathematical notation when we talk about perceptrons. We'll define our training dataset as $D$, and the number of points in $D$ as $n$, with the following notation:

$D = \{(x_i, y_i)\}_{i = 1}^{n}$

$x_i$ and $y_i$ have to meet certain criteria: $y_i$ can only be equal to -1 or 1, where -1 represents one class and 1 represents another class. This is expressed mathematically as:

$y_i \in \{-1 , +1\}$

We also specify that $x_i$ has to be a real number (this is not a condition that we really have to worry about in our applications of machine learning) in this way:

$x_i \in R^p$ 

$p$ is the number of features we have - i.e. $X$ is a $n \times p$ matrix.

A perceptron is a $p - 1$-dimensional hyperplane that perfectly separates $+1$ and $-1$. A hyperplane is defined as a "flat subspace of a $p$-dimensional space." 

Let's think about what this means intuitively, by considering different values of $p$. If $p = 1$, that means we're trying to divide a single set of predictors into two categories. In this case, a $(p - 1)$-dimensonal hyperplane is a 0-dimensional hyperplane - which in our case is just a single point! You can think of your $x$ values as falling along a number line, and your division being a single point on that line. If $p = 2$, we're dividing two sets of predictors into two categories, so we want a 1-dimensional hyperplane - this is a line. In this case, you can think of plotting your first predictor $x_1$ on an x-axis, and $x_2$ on a y-axis; a line can be drawn to separate observations.

**Question 1.1** If we have 3 predictors, what is the shape and dimension of a hyperplane that divides our observations into two classes? How would you plot this hyperplane?

*Your answer here*

In the next question, we are going to use `make_blobs` and `make_circles` extensively. These are sample generators made by `scikit-learn` package, which--as their names imply--will allow us to randomly generate blobs and circles.

The following cell is an example of how we might call `make_blobs`.

In [None]:
# Generating blobs of data with 2 centers
X, y = make_blobs(n_samples=100, centers=2, cluster_std=0.50, center_box=(-4, 4), random_state = 2020)

# Plotting the blobs of data
plt.title("Data of Two Randomly Generated Clusters")
plt.scatter(X[:, 0], X[:, 1], c=colormap[y], s=50)

**Question 1.2:** There are many hyperplanes that can separate these two clusters of data. Give 3 possible examples in the code below based on visually inspecting the plot above. We've started defining three tuples for you (`first`, `second`, and `third`), where the first item in each tuple is the slope and second item should be the y-intercept. In other words, `first/second/third = (slope, y-intercept)`.

(*Side note:* We talked about tuples at the start of the semester -- but here's a reminder: They're kind of like lists, in that they contain a sequence of values through which you can iterate, but unlike lists the values they contain can't be modified easily after they're initialized. Tuples are contained in round brackets (`()`) while lists are contained in square brackets (`[]`)).

In [None]:
# Define example hyperplanes here
first = (..., ...)
second = (..., ...)
third = (..., ...)

Run the following cell to double-check that your answers are right and reasonable.

In [None]:
# Plotting 
xfit = np.linspace(X[:,0].min(), X[:,0].max())
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)

for m, c in [first, second, third]:
    plt.plot(xfit, m * xfit + c, '-k')

plt.title("Two Randomly Generated Clusters and Hyperplanes Separating Them")
plt.xlim(xfit.min()-0.5, xfit.max()+0.5)

<b> Question 1.3: </b> There are multiple answers (we definitely know there are at least 3!) to Question 1.2. Explain why. 

*Your answer here*

**Question 1.4:** The issue explored above - i.e., that a dataset can allow for multiple hyperplanes that divide it - lead us to use maximal marginal classification (MMC) or hard-margin SVM in practice. In a few sentences, explain what differentiates MMC from the perceptrons that we've explored above.

*Your answer here*

We are now going to examine how to code Perceptrons, especially since this process is very similar to that used to code SVMs (which we will see later in the homework). First, we import the necessary library: 

In [None]:
from sklearn.linear_model import Perceptron

Next, we'll create our artificial dataset which has been hard-coded below. Later, we will use `scikit-learn`'s `samples_generator` to create our datasets. But, for now, let's try to develop a better understanding of the kind of data that we use to classify information.

In [None]:
x = np.array([
[2, 1, 2, 5, 8, 2, 3, 6, 1, 2, 5, 5, 5, 5, 6],
[2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 4, 7, 5, 7, 3]
])

y = np.array([0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1])

**Question 1.5:** Qualiatatively describe what `x` and `y` represent in the above block of code. What will the coordinate $(1, 3)$ be classified as?

*Hint*: `x` and `y` above function similarly to the `X` and `y` that were assigned as the output of `make_blobs()` a few questions ago!

*Your answer here*

Now that we have developed an understanding of what our data is, let's plot it.

In [None]:
plt.title("Data colored by class")
plt.scatter(x[0], x[1], c=colormap[y], s=40)

**Question 1.6:** Based on the graph, is this dataset linearly separable? Why or why not?

*Your answer here*

Now, we are going to try building code the finds perceptrons, using a modified dataset defined below:

In [None]:
data = np.array([
[2, 1, 2, 5, 7, 2, 3, 6, 1, 2, 5, 4, 6, 5],
[2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7]
])

label = np.array([0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1])
plt.title("Data colored by class")
plt.scatter(data[0], data[1], c=colormap[label], s=40)

**Question 1.7:** In the following block, set up our classifier, using `Perceptron`. Look at the following [link](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html) for more information on setting up a classifier with Perceptron. We want the model to have the following parameters:
1. `max_iter` = 100
2. `verbose` = 0
3. `eta0` = 0.001

In [None]:
# # YOUR CODE HERE
classifier = ...

**Question 1.8** In the following block, fit the Perceptron classifier you set up in Question 1.7. Again, please refer to the following [link](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html) for more information on using the classifier. 

Before we run `.fit()`, take a look at the above documentation for `Perceptron()`. Which variables will we use as the `X` and `y` inputs? Are their dimensions correct? If not, make any adjustments needed and then run `classifier.fit()`.

In [None]:
# # YOUR CODE/ANSWER HERE
...
classifier.fit(...)

Finally, let's run the following cell to see what kind of decision boundary we have come up with!

In [None]:
# Plot the original data
plt.scatter(data[0], data[1], c=colormap[label], s=40)

# Calc the hyperplane (decision boundary)
xmin, xmax = plt.xlim()

w = classifier.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(xmin, xmax)
yy = a * xx - (classifier.intercept_[0]) / w[1]

# Plot the line
plt.plot(xx, yy, 'k-')

**Question 1.9** In this question, repeat the steps in questions 1.7-1.8, but this time pass a value other than the default to the arguments `penalty` and `n_iter_no_change` (read the docstring or documentation above to figure out possible arguments for `penalty`. Pass a value greater than 5 to `n_iter_no_change`). How does the decision boundary change? What is changing about our implementation of the algorithm when we modify these parameters?

In [None]:
# YOUR CODE HERE

*Your answer here*

Note that the Perceptron Algorithm implemented here is satisfied when it finds *any* hyperplane that separates the data.  You can probably see that other hyperplanes would leave more "room" between the data at the hyperplane.  That's where maximal margin algorithms (which underlie SVM) come in.

---
### Section 2: SVM Intuition <a id='svm'></a>

Before we start classifying the CalEnviroScreen Dataset, let's review the intuition behind using Support Vector Machines. 

<img src="svd.png" width="400">

This is an example of an artificially created data-set to be classified, where red $+$ data points and green $\large{\circ}$ data points represent two different classes. In the following questions, assume the following: 
1. Training data comes from error-prone sensors, so we're not that confident in the location of any one point,
2. We are training our SVM using a __quadratic__ kernel. The hyperparameter $C$ will determine the location of the separating hyperplane.

Answer the following questions with a one line justification.

**Question 2.1** Given the potential decision boundaries below, which one has a large $C$ (i.e. $C \to \infty$) and which one has a small $C$ (i.e. $C \to 0$)?

<img src="svd2.png" width="400">

*Your answer here*

**Question 2.2** In this particular dataset, why might it be more advantageous to use a large value of $C$ than a small $C$ one?

*Your answer here*

**Question 2.3** Imagine you've received one additional data point in the green circle class. Name one coordinate this data point could have that will **not change** the decision boundary for small values of C. Justify your answer.

*Your answer here*

**Question 2.4** Name one coordinate the new data point could have that **will** change the decision boundary learned for small values of $C$. Justify your answer.

*Your answer here*

Before we begin the next set of questions, run the following cell to import the module we need to run SVM using `scikit-learn`. We will also call the different libraries that we will be using in this question, namely `samples_generator`.

In [None]:
from sklearn.svm import SVC

In this question, we are going to use `make_blobs` again. Run the cell below to generate another set of random blobs.

Keep this code in mind as you will later be asked to call `make_circles`. 

In [None]:
# Generating blobs of data with 2 centers
X, y = make_blobs(n_samples=50, centers=2, cluster_std=0.60, center_box=(-4, 4), random_state = 11)

# Plotting the blobs of data
plt.title("Data of Two Randomly Generated Clusters")
plt.scatter(X[:, 0], X[:, 1], c=colormap[y], s=50)

Let's now use support vector machines to classify our data. 

**Question 2.5** Make the Support Vector Machine Classifier in the following cell with a linear kernel.

*Hint:* Use `SVC` to create your model and refer to this [link](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more information. **Important note:** Pay close attention to the definition of the parameter `C` as applied in scikit-learn! It's not quite the same as the hyperparameter $C$. 

Then, fit your support vector machine on the data. Remember, the data for blobs is `X` and labels are `y`.

In [None]:
# # YOUR CODE HERE
clf = ...

Now, let's have a look at what we have made by running the following cell.

In [None]:
# Graphing SVM and Data
plt.scatter(X[:, 0], X[:, 1], c=colormap[y], s=50)
plot_svc_decision_function(clf)

plt.title("SVM on the Blobs of Data")

So far, we have only really dealt with linearly separable data. What happens if the data is not linearly separable? Let's examine these cases in the following questions. 

**Question 2.6** Just as we have created our dataset for blobs of data, make a new dataset for circles below. Plot that data, using different colors for the different classes of data.

*Hint: Refer to the code above and use the function `make_circles`[https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html]*

In [None]:
# # YOUR CODE HERE
X, y = make_circles(100, factor=.1, noise=.2)

# # plot data

**Question 2.7**: Train this new dataset using a linear kernel and plot the points along with the decision boundary, using the function `plot_svc_decision_function()`.

In [None]:
# YOUR CODE HERE

The linear kernel doesn't really do a good job of distinguishing these two classes. Let's try a radial basis kernel, called `"rbf"` in scikit-learn. 

**Question 2.8** Repeat the steps in question 2.7, but with a radial basis kernel.

In [None]:
# YOUR CODE HERE


---
### Section 3: Using SVM to Classify CalEnviroScreen Data <a id='classify'></a>

<br>
Now that we've explored how SVM works and how to implement it, let's begin applying our knowledge on the CalEnviroScreen dataset.

In [None]:
# run this cell
env = pd.read_excel('ces3results.xlsx')
demog = pd.read_excel('ces3results.xlsx', sheet_name = 3, header = 1)
demog = demog.rename(columns = {'Unnamed: 0': 'Census Tract',
                          'Unnamed: 1': 'CES 3.0 Score',
                          'Unnamed: 2': 'CES 3.0 Percentile',
                          'Unnamed: 3': 'CES 3.0 Percentile Range',
                          'Unnamed: 4': 'Total Population',
                          'Unnamed: 5': 'California County'
    
})

Second, let's have a look at what each of these dataframes contains (these should look familiar from HW10).

In [None]:
env.head()

In [None]:
demog.head()

Before we get started, we'll select only the columns we want. We'll be predicting the Percentile Range based on Unemployment and PM2.5, and we'll only be looking at records with a percentile > 95% or < 5%. Run the cell below to grab just those columns and rows and remove any NAs.

In [None]:
# run this cell
env_model = env[['Unemployment', 'PM2.5','CES 3.0 \nPercentile Range']]
env_model = env_model.loc[env_model['CES 3.0 \nPercentile Range'].isin(['95-100% (highest scores)', '1-5% (lowest scores)'])]
env_model.dropna(inplace = True)
env_model.head()

**Question 3.1:** Now, given the dataframe `env_model`, create a dataframe `X` that contains the predictors and `y` that contains the response. Then, split `X` and `y` into train and test sets, using an 80/20 train/test split and a `random_state` of your choice.

In [None]:
# YOUR CODE HERE

**Question 3.2:** Let's start classifying information now. Below, like *Question 2.7*, make a SVM classifier with a linear kernel (choose $C$ of your choice! If it takes too long, it is probably too high) and train the data. 

*Hint: If you get the error,* 

>`DataConversionWarning`: A column-vector ?? was passed when a `1d` array was expected. Please change the shape of ?? to (n_samples, ), for example using `ravel()`

*use ??.values.ravel() to override this issue*

In [None]:
# YOUR CODE HERE

**Question 3.3:** Use the classifier to predict the outcome of our `X_test` and save the output to `y_pred`.

In [None]:
# YOUR CODE HERE

Congratulations! You have completed training a data-set. Run the box below to see if you are getting a reasonable percentage of correct matches.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  

**Question 3.4** Interpret the values in the confusion matrix. What does each of the four values mean? The [documentation for `confusion_matrix()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) is a good reference.

*Your answer here*

**Question 3.5** Based on the classification report, are there more false positives for the "lowest score" class or are there more false negatives? Which values in the classification report give you this information?

For reference, see the documentation for [`classification_report()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html); scrolling down to the "Returns" section and following relevant links there will also be helpful.

*Your answer here*

Run the cell below to see what our classification looks like! Note that this code assumes that `y` is a Pandas series and `X` is a Pandas dataframe; if you set up your `X` and `y` differently you may have to make slight modifications where those variables are called.

In [None]:
# run this cell
plt.figure(figsize = (10,7))

classes = pd.unique(y)
y_label = np.array([np.where(classes == i)[0][0] for i in y.values])

labels = [None]*len(y_label)
for i in classes:
    labels[np.where(y == i)[0][0]] = i

for i in range(len(X)):
    plt.scatter(X.iloc[i, 0], X.iloc[i, 1], c=colormap[y_label[i]], s=20, label = labels[i])

plot_svc_decision_function(svclassifier)

plt.xlabel("unemployment")
plt.ylabel("PM2.5")
plt.title("SVM on Cal Enviroscreen Data")

plt.legend()

---
## Submission

Congrats, you finished the final homework!

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file through bCourses.

---

## Bibliography

Carnegie Mellon University's Machine Learning Course (10 - 701) - Images on Question 2 - http://www.cs.cmu.edu/~ninamf/courses/601sp15/lectures.shtml

Jayanta Basak, *A Least Square Kernel Machine with Box Constraints* - Inspiration and Formula for Question 4.4 - http://www.jprr.org/index.php/jprr/article/viewfile/181/57

Jake VanderPlas - Function for Drawing SVC Decision Boundaries - *Python Data Science Handbook*

---
Notebook developed by: Beom Jin Lee (Brian)

Data Science Modules: http://data.berkeley.edu/education/modules
