Dylan Miskowski

# CMSE 202 Homework 4 (Individual)

## Using SVM and PCA to predict the outcome of chess games

### Goals for this homework assignment

By the end of this assignment, you should be able to:

* Use `git` to track your work and turn in your assignment
* Read and impute data to prepare it for modeling
* Build, fit, and evaluate an SVC model of data
* Use PCA to reduce the number of important features
* Build, fit, and evaluate an SVC model of pca transformed data
* Systematically investigate the effects of the number of components on an SVC model of data


### Assignment instructions:

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

There are 25 points possible on this assignment. Point values for each part are included in the section headers.

This assignment is due at 11:59 pm on Friday, November 13th. It should be pushed to your repo (See Part 1). 

In [None]:
## Our imports
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.decomposition import PCA

---
## 1. Adding notebook to your turn-in repository

Like you did for Homework 3, you're going to add it to the CMSE202 repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to:

* Navigate to your /CMSE202/repos repository and create a new directory called hw-04.
* Move this notebook into that new directory in your repository, then add it and commit it to your repository.
   * Finally, to test that everything is working, "git push" the file so that it ends up in your GitHub repository.

Important: Make sure you've added your TA as a collaborators to your respository with "Read" access so that we can see your assignment. (*If you did this for Homework 3, you do not need to do it again*)

* Section 001: tuethan
* Section 002: Luis-Polanco
* Section 003: DavidRimel

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked.

If everything went as intended, the file should now show up on your GitHub account CMSE202 repository under the hw-04 directory that you just created. Periodically, you'll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.

---
## 2. Chess Game Data

The data you will work are configurations of a chess end game. It assumes that a pawn is one move away from ["queening"](https://en.wikipedia.org/wiki/Promotion_(chess)) and 
the other pieces can be moved to perform different offensive or defensive actions. For each of the 36 potential features, there are several potential values for each (entries in a given column). **The details of the data matter a bit less for our purposes, but we are attempting to predict the won/loss by a given side.** If you really want to know about the data, you can look into a [classic text on Artificial Intelligence by Shapiro](https://www.amazon.com/Encyclopedia-Artificial-Intelligence-Stuart-Shapiro/dp/0471807486).

You will first do this with a full model, then investigate how well the model works after a PCA has been done on the data.

### 2.1 Read in the data

First you need to read in the data from `kr-vs-kp.data`. You can look at `kr-vs-kp.names` to see how the data is structured. But we give you the code for the column naming as there are so many features and they are unlabeled in the `.data` file.

```cols = ["bkblk","bknwy","bkon8","bkona","bkspr","bkxbq","bkxcr","bkxwp","blxwp","bxqsq","cntxt","dsopp","dwipd",
 "hdchk","katri","mulch","qxmsq","r2ar8","reskd","reskr","rimmx","rkxwp","rxmsq","simpl","skach","skewr",
 "skrxp","spcop","stlmt","thrsk","wkcti","wkna8","wknck","wkovl","wkpos","wtoeg","won"]```
 
<font size=8 color="#009600">&#9998;</font> Do this - Read in the data from `kr-vs-kp.data` using the columns listed above. Print the `.head()` of the dataframe.

In [None]:
chess=pd.read_csv('kr-vs-kp.data')
chess.head()

### 2.2 Imputing the data

There are no missing data in this data file, but there are some other issues. 

When you print the head of this data set, you probably noticed that all the features and labels are strings. We need to replace them with numerical values for modeling. For the `won` column replace winning with a 1 and losing with a 0. For the other columns, there are seven strings. Replace them using the following table:

| raw data | replaced |
| -------- | -------- |
| f | 1 |
| l | 2 |
| n | 3 |
| t | 4 |
| w | 5 |
| b | 6 |
| g | 7 |

**Note:** this choice really matters and for the models we have learned can really influence the results of our model. We do this because we need to for the model, but we haven't critically thought about the mapping that makes the most sense. There are other models (e.g., [tree-based alogrithms](https://en.wikipedia.org/wiki/Random_forest)) that can handle these categorical data without this mapping.

<font size=8 color="#009600">&#9998;</font> Do this - Replace the entries in the columns as indicated above. Print the `.head()` of the dataframe to show you have succesfull done so.

In [None]:
#replace entries
chess=chess.replace('f',1)
chess=chess.replace('l',2)
chess=chess.replace('n',3)
chess=chess.replace('t',4)
chess=chess.replace('w',5)
chess=chess.replace('b',6)
chess=chess.replace('g',7)
chess=chess.replace('won',1)
chess=chess.replace('nowin',0)





chess.head()

### 2.3 Separate features and class labels

As we have seen in our analyses using `sklearn` it is advantageous to separate our dataframes into `features` and `labels` for the analysis we are intending to do.

<font size=8 color="#009600">&#9998;</font> Do this - Separate the data frame into two: a features dataframe and a labels dataframe.

In [None]:
chess_labels=chess['won']
chess_features=chess.drop('won', axis=1)


**Question:** How balanced is your outcome variable? Why does it matter for the outcome to be balanced?

<font size=8 color="#009600">&#9998;</font> the data seems pretty balanced, and its important to ensure the accuracy of our predictions

---
## 3. Building an SVC model

For this classification problem, we will use an support vector machine. As you learned in the midterm review, we could easily replace this with any `sklearn` classifier we choose. We will use a linear kernel.

### 3.1 Splitting the data

<font size=8 color="#009600">&#9998;</font> Do this - Split your data into a training and testing set with a train size representing 75% of your data. Print the lengths to show you have the right number of entries.

In [None]:
train_vectors, test_vectors, train_labels, test_labels= train_test_split(chess_features,chess_labels, test_size=0.25)


### 3.2 Modeling the data and evaluting the fit

As you have done this a number of times, we ask you to do most of the analysis in one cell.

<font size=8 color="#009600">&#9998;</font> Do this - Build a linear SVC model (`C=100`), fit it to the training set, use the test features to predict the outcomes. Evaluate the fit using the confusion matrix and classification report.

 **Note:** You should look at the documentation on the confusion matrix because the way `sklearn` outputs false positives and false negatives is different from what most images on the web indicate.

In [None]:
#build svc model (c=100) 
svc_linear = SVC(C = 100)
#fit to data
svc_linear.fit(train_vectors, train_labels)

#use test features to predict outcomes
pred = svc_linear.predict(test_vectors)

#eval fit w/ confusion matrix and classification report
print('Confusion Matrix: \n', confusion_matrix(test_labels, pred))
#and classification report
print(classification_report(test_labels,pred))

**Question:** How accurate is your model? What eveidence are you using to determine that? How many false positives and false negatives does it predict?

<font size=8 color="#009600">&#9998;</font> Id say it is quite accurate with high precision as well as high True+ / True- values and low False +/- values 

---
## 4. Finding and using the best hyperparameters

We have fit one model and determined it's performance, but is it the best model? We can use `GridSearchCV` to find the best model (given our choices of parameters). Once we do that, we will use that best model going forward. **Note:** you would typically rerun this grid search in a production environment to continue to verify the best model, but we are not for the sake of speed.

### 4.1 Grid search

<font size=8 color="#009600">&#9998;</font> Do this - Using the following parameters (`C` = 1, 10, 100, 1000 and `gamma` = 1e-4, 1e-3, 0.01, 0.1) for both a `linear` and `rbf` kernel use `GridSearchCV` with the `SVC()` model to find the best fit parameters. Print the "best estimators".

In [None]:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [ 1e-4, 1e-3, 0.01,0.1],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'gamma': [ 1e-4, 1e-3, 0.01,0.1], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(
        SVC(), tuned_parameters, scoring='%s_macro' % score
    )
    clf.fit(train_vectors, train_labels)
    print("Best parameters set found on development set:")

    print(clf.best_params_)
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))

Results from above (takes a while to run)


Best parameters set found on development set:
{'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
0.819 (+/-0.018) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.905 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.951 (+/-0.016) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.970 (+/-0.017) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.904 (+/-0.016) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.942 (+/-0.011) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.987 (+/-0.008) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.972 (+/-0.010) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.941 (+/-0.011) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.969 (+/-0.011) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.989 (+/-0.007) for {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
0.972 (+/-0.010) for {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}
0.960 (+/-0.011) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.003) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.988 (+/-0.010) for {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}
0.972 (+/-0.010) for {'C': 1000, 'gamma': 0.1, 'kernel': 'rbf'}
0.961 (+/-0.014) for {'C': 1, 'gamma': 0.0001, 'kernel': 'linear'}
0.961 (+/-0.014) for {'C': 1, 'gamma': 0.001, 'kernel': 'linear'}
0.961 (+/-0.014) for {'C': 1, 'gamma': 0.01, 'kernel': 'linear'}
0.961 (+/-0.014) for {'C': 1, 'gamma': 0.1, 'kernel': 'linear'}
0.963 (+/-0.012) for {'C': 10, 'gamma': 0.0001, 'kernel': 'linear'}
0.963 (+/-0.012) for {'C': 10, 'gamma': 0.001, 'kernel': 'linear'}
0.963 (+/-0.012) for {'C': 10, 'gamma': 0.01, 'kernel': 'linear'}
0.963 (+/-0.012) for {'C': 10, 'gamma': 0.1, 'kernel': 'linear'}
0.964 (+/-0.009) for {'C': 100, 'gamma': 0.0001, 'kernel': 'linear'}
0.964 (+/-0.009) for {'C': 100, 'gamma': 0.001, 'kernel': 'linear'}
0.964 (+/-0.009) for {'C': 100, 'gamma': 0.01, 'kernel': 'linear'}
0.964 (+/-0.009) for {'C': 100, 'gamma': 0.1, 'kernel': 'linear'}
0.963 (+/-0.010) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'linear'}
0.963 (+/-0.010) for {'C': 1000, 'gamma': 0.001, 'kernel': 'linear'}
0.963 (+/-0.010) for {'C': 1000, 'gamma': 0.01, 'kernel': 'linear'}
0.963 (+/-0.010) for {'C': 1000, 'gamma': 0.1, 'kernel': 'linear'}
# Tuning hyper-parameters for recall

Best parameters set found on development set:
{'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
0.741 (+/-0.040) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.905 (+/-0.015) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.949 (+/-0.016) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.969 (+/-0.017) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.905 (+/-0.016) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.939 (+/-0.012) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.987 (+/-0.008) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.973 (+/-0.011) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.938 (+/-0.012) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.970 (+/-0.011) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.989 (+/-0.007) for {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
0.973 (+/-0.011) for {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}
0.959 (+/-0.013) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.003) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.988 (+/-0.009) for {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}
0.973 (+/-0.011) for {'C': 1000, 'gamma': 0.1, 'kernel': 'rbf'}
0.961 (+/-0.014) for {'C': 1, 'gamma': 0.0001, 'kernel': 'linear'}
0.961 (+/-0.014) for {'C': 1, 'gamma': 0.001, 'kernel': 'linear'}
0.961 (+/-0.014) for {'C': 1, 'gamma': 0.01, 'kernel': 'linear'}
0.961 (+/-0.014) for {'C': 1, 'gamma': 0.1, 'kernel': 'linear'}
0.962 (+/-0.012) for {'C': 10, 'gamma': 0.0001, 'kernel': 'linear'}
0.962 (+/-0.012) for {'C': 10, 'gamma': 0.001, 'kernel': 'linear'}
0.962 (+/-0.012) for {'C': 10, 'gamma': 0.01, 'kernel': 'linear'}
0.962 (+/-0.012) for {'C': 10, 'gamma': 0.1, 'kernel': 'linear'}
0.964 (+/-0.010) for {'C': 100, 'gamma': 0.0001, 'kernel': 'linear'}
0.964 (+/-0.010) for {'C': 100, 'gamma': 0.001, 'kernel': 'linear'}
0.964 (+/-0.010) for {'C': 100, 'gamma': 0.01, 'kernel': 'linear'}
0.964 (+/-0.010) for {'C': 100, 'gamma': 0.1, 'kernel': 'linear'}
0.962 (+/-0.010) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'linear'}
0.962 (+/-0.010) for {'C': 1000, 'gamma': 0.001, 'kernel': 'linear'}
0.962 (+/-0.010) for {'C': 1000, 'gamma': 0.01, 'kernel': 'linear'}
0.962 (+/-0.010) for {'C': 1000, 'gamma': 0.1, 'kernel': 'linear'}

### 4.2 Evaluating the best fit model

Now that we have found the "best estimators", let's determine how good the fit is.

<font size=8 color="#009600">&#9998;</font> Do this - Use the test features to predict the outcomes for the best model. Evaluate the fit using the confusion matrix and classification report. 

**Note:** You should look at the documentation on the confusion matrix because the way `sklearn` outputs false positives and false negatives is different from what most images on the web indicate.

In [None]:
#0.989 (+/-0.007) for {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
#build svc model (
svc_linear2 = SVC(C = 100, gamma=0.01, kernel='rbf')

#fit to data
svc_linear2.fit(train_vectors, train_labels)

#use test features to predict outcomes
pred2 = svc_linear2.predict(test_vectors)

#eval fit w/ confusion matrix and classification report
print('Confusion Matrix: \n', confusion_matrix(test_labels, pred))
#and classification report
print(classification_report(test_labels,pred2))

**Question:** How accurate is this best model? What evidence are you using to determine that? How many false positives and false negatives does it predict?

<font size=8 color="#009600">&#9998;</font> The model is pretty accurate, it actually produces the exact same confusion matrix and report as our initial values. 

---
## 5. Using Principal Components

The full model uses 36 features to predict the results. And you likely found that the model is incredibly accurate. But in some cases, we might have even more features (which means much more computational time), and we might not need nearly the level of accuracy we can achieve with the full data set. So, we will see how close we can get with fewer features. But instead of simply removing features, we will use a PCA to determine the featurse that contribute the most the model (through their accounted variance) and use those to build our SVC model.

### 5.1 Building a PCA

We will start with a small number of components (say, 4) to see how well we can predict the outcomes of the games.

<font size=8 color="#009600">&#9998;</font> Do this - Using `PCA()`, fit a pca to your training features with 4 components. Transform both the test and training features using this pca. Plot the `explained_variance_` versus component number.

In [None]:
#use PCA fit a pca to training w/ n=4.
#Set up the pca object with the number of compoents we want to find
pca = PCA(n_components=4, whiten=True)

#Fit the training data to the pca model.
_ = pca.fit(train_vectors)
#transform both test and training features using this pca
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

print("Training set changed from a size of: ", train_vectors.shape, ' to: ', pca_train_vectors.shape)
print("Testing set changed from a size of: ", test_vectors.shape, ' to: ', pca_test_vectors.shape)
#plot the explained_variance_ vs compenent number

plt.plot(pca.explained_variance_ratio_, marker="o")
print(pca.explained_variance_ratio_)
total_variance = np.sum(pca.explained_variance_ratio_)*100
print('total variance is',total_variance)

**Question:** What is the total explained variance captured by this PCA (we will use this later, just quote the number)? How well do you think a model with this many featuers will perform? Why?

<font size=8 color="#009600">&#9998;</font> the total explained variance is about 44%. I do not think this model would perform very well. It seems to be using way too few features and is not covering enough of the vairance. 

### 5.2 Fit and Evaluate an SVC model

Using the pca transformed features, we will train and test an SVC model using the "best estimators".

<font size=8 color="#009600">&#9998;</font> Do this - Using the pca transformed training data, build and train an SVC model. Predict the classes using the pca transformed test data. Evaluate the model using the classfication report, and the confusion matrix.

In [None]:
#using the pca transformed training data, build and train an svc model
svc_linear3 = SVC(C = 100, gamma=0.01, kernel='rbf')

#pred_test_labels = clf.predict(pca_test_vectors) #predicting labels based on the new dataset


#fit to data
svc_linear3.fit(pca_test_vectors, test_labels) #no idea if you need to modify the labels further? Ok now it works for some reason


#predict the classes using the pca transformed test data
pred3 = svc_linear3.predict(pca_test_vectors)



#eval the model using the classifcation report and confusion matrix
print('Confusion Matrix: \n', confusion_matrix(test_labels, pred3))
#
print(classification_report(test_labels,pred3))

**Question:** How accurate is this model? What evidence are you using to determine that? How many false positives and false negatives does it predict? How does it compare to the full model?

<font size=8 color="#009600">&#9998;</font> Looking at the precision scores which are about 0.64 and the massive increase in false +/- rates I would say this model is not very accurate 

### 5.3 Repeat your analysis with more components

You probably found that the model with 4 features didn't work so well. What if we increase the number of components (say to 30, which is still 6 fewer than the full data set). What happens now?

<font size=8 color="#009600">&#9998;</font> Do this - Repeat your analysis from 5.1 and 5.2 using 30 components instead.

In [None]:
#repeat above but with n=30

#use PCA fit a pca to training w/ n=4.
#Set up the pca object with the number of compoents we want to find
pca2 = PCA(n_components=30, whiten=True)

#Fit the training data to the pca model.
_ = pca2.fit(train_vectors)
#transform both test and training features using this pca
pca_train_vectors2 = pca2.transform(train_vectors)
pca_test_vectors2 = pca2.transform(test_vectors)

print("Training set changed from a size of: ", train_vectors.shape, ' to: ', pca_train_vectors2.shape)
print("Testing set changed from a size of: ", test_vectors.shape, ' to: ', pca_test_vectors2.shape)
#plot the explained_variance_ vs compenent number

plt.plot(pca2.explained_variance_ratio_, marker="o")
print(pca2.explained_variance_ratio_)
total_variance2 = np.sum(pca2.explained_variance_ratio_)*100
print('total variance is',total_variance2)
#using the pca transformed training data, build and train an svc model
svc_linear4 = SVC(C = 100, gamma=0.01, kernel='rbf')

#fit to data
svc_linear4.fit(pca_test_vectors2, test_labels) #no idea if you need to modify the labels further? Ok now it works for some reason


#predict the classes using the pca transformed test data
pred4 = svc_linear4.predict(pca_test_vectors2)



#eval the model using the classifcation report and confusion matrix
print('Confusion Matrix: \n', confusion_matrix(test_labels, pred4))
#
print(classification_report(test_labels,pred4))

**Question:** What is the total explained variance captured by this PCA? How accurate is this model? What evidence are you using to determine that? How many false positives and false negatives does it predict? How does it compare to the 4 component model? To the full model?

<font size=8 color="#009600">&#9998;</font> the total variance is 99.33. This model is much more accurate than n=4 and pretty close to the original. I am looking at the confusion matrix T+/- and F+/- . 

---
## 6. How well does a PCA work?

Clearly, the number of components we use in our PCA matters. Let's investigate how they matter by systematically building a model for any number of selected components.

### 6.1 Accuracy vs. Components

We will do this by writing a function that creates the PCA, the SVC model, fits the training data, predict the labels using test data, and returns the accuracy scores and the explained variance. So your function will take as input:
* the number of components
* the training features
* the test features
* the training labels
* the test labels
and it will return the accuracy scores for an SVC model fit to pca transformed features and the total explained variance.

<font size=8 color="#009600">&#9998;</font> Do this - Create this function, which you will use in the next section.

In [None]:
#fcn takes as input #compenents, training features, test features, training labels, testl labels
#fcn does above stuff
#returns accuracy scores from svc model fit into pca transformed features AND total explained variance


def wizard_chess(n,train_vectors, test_vectors, train_labels, test_labels):
    #use PCA fit a pca to training w/ n=4.
    #Set up the pca object with the number of compoents we want to find
    pca = PCA(n_components=n, whiten=True)

    #Fit the training data to the pca model.
    _ = pca.fit(train_vectors)
    #transform both test and training features using this pca
    pca_train_vectors = pca.transform(train_vectors)
    pca_test_vectors = pca.transform(test_vectors)
    
    #helpful troubleshooting from above
    #print("Training set changed from a size of: ", train_vectors.shape, ' to: ', pca_train_vectors2.shape)
    #print("Testing set changed from a size of: ", test_vectors.shape, ' to: ', pca_test_vectors2.shape)
    
       
    total_variance = np.sum(pca.explained_variance_ratio_)*100
    #print('total variance is',total_variance)
    
    
    #using the pca transformed training data, build and train an svc model
    svc_linear = SVC(C = 100, gamma=0.01, kernel='rbf')

    #fit to data
    svc_linear.fit(pca_test_vectors, test_labels) #no idea if you need to modify the labels further? Ok now it works for some reason

    
    #predict the classes using the pca transformed test data
    pred = svc_linear.predict(pca_test_vectors)
    acc_score=accuracy_score(test_labels,pred)
    return acc_score, total_variance,n 

    #eval the model using the classifcation report and confusion matrix
    #print('Confusion Matrix: \n', confusion_matrix(test_labels, pred4))
    #
    #print(classification_report(test_labels,pred4))

### 6.2 Compute accuracies

Now that you have created a function that returns the accuracy for a given number of components, we will use that to plot the how the accuracy of your SVC model changes when we increase the number of components used in the PCA.

<font size=8 color="#009600">&#9998;</font> Do this - For 1 to 36 components, use your function above to compute and store (as a list) the accuracy of your models.

In [None]:
model_acc=[]
total_var=[]
n_value=[]
for i in range(1,37):
    a,b,c = wizard_chess(i,train_vectors, test_vectors, train_labels, test_labels)
    model_acc.append(a)
    total_var.append(b)
    n_value.append(c)

    
print(model_acc)
print(total_var)
print(n_value)

### 6.3 Plot accuracy vs number of components

Now that we have those numbers, it makes sense to look at the accuracy vs components.

<font size=8 color="#009600">&#9998;</font> Do this - Plot the accuracy vs components.

In [None]:
#accuracy v compenents
plt.plot(n_value, model_acc)
plt.xlabel('# components')
plt.ylabel('accuracy')

**Question:** Where does it seem like we have diminishing returns, that is, no major increase in accuracy as we add additional components to the PCA?

<font size=8 color="#009600">&#9998;</font> at about 16 or 17

### 6.4 Plot total explained variance vs number of components

<font size=8 color="#009600">&#9998;</font> Do this - Plot the total explained variance vs components. 

In [None]:
#accuracy v compenents
plt.plot(n_value, total_var)
plt.xlabel('# components')
plt.ylabel('Variance')

**Question:** Where does it seem like we have diminishing returns, that is, no major increase in explained variance as we add additional components to the PCA? How does that number of components compare to the diminishing returns for accuracy?

<font size=8 color="#009600">&#9998;</font> at about 25ish. Its a few more components after my accuracy chart

---
## 7. Assignment wrap-up¶
Please fill out the form that appears when you run the code below. **You must completely fill this out in order to receive credit for the assignment!**

In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://docs.google.com/forms/d/e/1FAIpQLSc0IBD2mdn4TcRyi-KNXVtS3aEg6U4mOFq2MOciLQyEP4bg1w/viewform?usp=sf_link" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

### Congratulations, you're done!
Submit this assignment by uploading it to the course Desire2Learn web page. Go to the "Homework Assignments" folder, find the dropbox link for Homework 4, and upload your notebook.