# <center><strong>Important:</strong> Make a Copy of this Google Colab Notebook!
</center>

<p>Please refrain from using or modifying this current Google Colab notebook directly. Instead, follow these instructions to create your own copy:</p>

<ol>
  <li>Go to the "File" menu at the top of the Colab interface.</li>
  <li>Select "Save a copy in Drive" to create a duplicate of this notebook.</li>
  <li>You can now work on your own copy without affecting the original.</li>
</ol>

<p>This ensures that you have a personalized version to work on and make changes according to your needs. Remember to save your progress as you go. Enjoy working on your own copy of the Google Colab notebook!</p>

# **Module 24 Expanding SVM to Additional Datasets**
In this module, you will apply Python built-in SVM functions for a binary classfication task from scikit-learn. The goal is for you to implement these functions on a dataset of your choosing from scikit-learn datasets. You can use Module 22 for guidance.

## **Getting Started**

Run the provided code sections and follow the instructions.

##**Importing Python Packages**
The first step is to import your necessary Python packages.

In [2]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.inspection import DecisionBoundaryDisplay

## **Choosing a New Dataset**



There are many datasets available that can be used for the task of binary classification. In this example, we will continue to use datasets made readily available to use through scikit-learn. Let's start by exploring the datasets that are available within scikit-learn. To do this, we'll start by listing all of the availabe datasets using the `dir` command. This function lists all of the items within a given directory. In this case, our directory is `datasets` since we imported the datasets from sklearn above.

In [3]:
dir(datasets)

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__getattr__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_arff_parser',
 '_base',
 '_california_housing',
 '_covtype',
 '_kddcup99',
 '_lfw',
 '_olivetti_faces',
 '_openml',
 '_rcv1',
 '_samples_generator',
 '_species_distributions',
 '_svmlight_format_fast',
 '_svmlight_format_io',
 '_twenty_newsgroups',
 'clear_data_home',
 'dump_svmlight_file',
 'fetch_20newsgroups',
 'fetch_20newsgroups_vectorized',
 'fetch_california_housing',
 'fetch_covtype',
 'fetch_kddcup99',
 'fetch_lfw_pairs',
 'fetch_lfw_people',
 'fetch_olivetti_faces',
 'fetch_openml',
 'fetch_rcv1',
 'fetch_species_distributions',
 'get_data_home',
 'load_breast_cancer',
 'load_diabetes',
 'load_digits',
 'load_files',
 'load_iris',
 'load_linnerud',
 'load_sample_image',
 'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine',
 'make_biclusters',
 'make_blobs',
 'make_checkerboard',
 'make_circl

As we can see there are many datasets that are available from scikit-learn. However, only some of these datasets are useful for simple binary classification tasks. You can learn more about the contents of these datasets to determine if they are suitable for binary classification by outputting a description of the dataset. We can do this by printing the `.DESCR` for a given dataset. In the example below, we output the description for the `load_breast_cancer` dataset that we used for the previous modules.

In [8]:
print(datasets.load_breast_cancer().DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

For the task of binary classification, we recommend the following datasets which can be modified to work for a binary classification task:

*   `load_wine`
*   `load_iris`
*   `load_digits`
*   `load_diabetes`

While several of these datasets contain more than 2 classes, you can choose to only use data from 2 of the classes in order to have a dataset for binary classification. This can be done by augmenting the dataset to remove any rows that contain data from the other classes which you are not considering in your SVM model. We recommend using the `np.where` function to get a list of index values for the rows you wish to keep from the data. This can be done using `indices_to_keep = np.where(y != classToRemove)[0]` where `classToRemove` is equal to which class you with to filter from the dataset. You can then filter your feature matrix and target vector using `X_filtered = X[indices_to_keep]` where `X` is the matrix or vector you are trying to filter.

Load and (if necessary) augment your dataset below:


In [19]:
# Load dataset

# Make adjustment for binary classification if necessary:


## **Implementing SVM Using a New Dataset**

### **Splitting the Dataset**

Next, we need to split the dataset into our training and testing datasets. This is necessary since we need to retain a testing dataset to see how the model will perform on unseen observations. To split the dataset, we'll use the function `train_test_split()` from sklearn.model_selection. The function takes in the following:
*   `X`: The feature data, typically represented as a NumPy array or pandas DataFrame.
*   `y`: The target variable or label data, corresponding to the feature data.
*   `test_size`: The proportion (between 0.0 and 1.0) of the dataset to include in the test split. For example, `test_size=0.3` would create a test set comprising 30% of the total data, while the remaining 70% is allocated to the training set.
*   `random_state`: The seed value used for random shuffling and splitting of the dataset. Setting a specific random_state ensures reproducibility of the split. If random_state is not provided, the data will be split differently each time the function is called.

The `train_test_split()` function returns four subsets:
*   `X_train`: The training set of feature data.
*   `X_test`: The test set of feature data.
*   `y_train`: The corresponding target variable for the training set.
*   `y_test`: The corresponding target variable for the test set.

In [16]:
# Implement code for using train_test_split here:


### **Training the Model**

Now that we have our training and testing datasets, we can run the SVM algorithm using the bulit in functions from scikit-learn. The model we are using is the `svm.SVC` function where SVC stands for support vector classification. The inputs to the fuction that we will focus on are `kernel` and `C`.

*   kernel: Specifies the kernel type to be used in the SVM algorithm. Common choices include 'linear', 'poly', 'rbf' (Radial Basis Function), 'sigmoid', and more. The default is 'rbf'.
*   C: Penalty parameter C of the error term. It controls the trade-off between maximizing the margin and minimizing the classification errors. Higher values of C prioritize correct classification of training examples, potentially leading to overfitting. The default value is 1.0.

We are going to focus on using a linear kernel since we want to linearly separate the data. However, we can adjust the value for the regularization strength, `C`.

We will also be using the function `svm.SVC.fit` which trains the SVM model on the given training data `X_train` and corresponding labels `y_train`. Since we set the output of `svm.SVC` to be `model`, we can call this function as `model.fit`.


In [17]:
# Write code to use svm.SVC and svm.SVC.fit here:


### **Testing the Model**

Now we can use the SVM model in order to predict the labels on the testing dataset. This is done using the `svm.SVC.predict` function which we are calling as `model.predict`. The input to this function is the testing data and the output is the predicted labels from the SVM algorithm.

Additionally, we can compute three metrics to describe the results of the prediction. In this case, we are using the accuracy, precision, and recall.

*   **Precision**: Precision is a measure of the model's ability to correctly identify positive instances (true positives) out of all instances predicted as positive. It is calculated as the ratio of true positives (TP) to the sum of true positives and false positives (FP).
*   **Recall (Sensitivity or True Positive Rate)**: Recall measures the ability of the model to correctly identify positive instances (true positives) out of all actual positive instances. It is calculated as the ratio of true positives (TP) to the sum of true positives and false negatives (FN).
*   **Accuracy**: Accuracy measures the overall correctness of the model's predictions. It calculates the ratio of the number of correct predictions (true positives and true negatives) to the total number of instances. Accuracy provides an overall measure of the model's performance, considering both positive and negative predictions. However, accuracy alone may not be sufficient if the dataset is imbalanced (i.e., when the number of instances in one class is much higher than the other), as the model may achieve high accuracy by simply predicting the majority class.





In [18]:
# Write code to use svm.SVC.predict here:


Accuracy on test dataset: 0.9230769230769231
Precision on test dataset: 1.0
Recall on test dataset: 0.8421052631578947


In [None]:
# Write code to output the accuracy, precision, and recall here using metrics.accuracy_score, metrics.precision_score, and metrics.recall_score:

Accuracy =
Precision =
Recall =

print("Accuracy on test dataset:",Accuracy)
print("Precision on test dataset:",Precision)
print("Recall on test dataset:",Recall)

## ✨ **Congratulations you have now coded the SVM algorithm using scikit-learn!** ✨