## 21.2 Classification


### Class Objectives

* Will understand how to calculate and apply the fundamental classification algorithms: logistic regression, SVM, KNN, decision trees, and random forests.

* Will understand how to quantify and validate classification models including calculating a classification report.

* Will understand how to apply `GridSearchCV` to hyper tune model parameters.


# Instructor Turn Activity 1 Logistic Regression

Logistic Regression is a statistical method for predicting binary outcomes from data.

Examples of this are "yes" vs "no" or "young" vs "old". 

These are categories that translate to probability of being a 0 or a 1 

We can calculate logistic regression by adding an activation function as the final step to our linear model. 

This converts the linear regression output to a probability.


  * Logistic Regression is a statistical method for predicting binary outcomes from data. With linear regression, our linear model may provide a numerical output such as age. With logistic regression, the numerical value for age could be translated to a probability between 0 and 1. This discrete output could then be labeled as "young" vs "old".

    ![logistic-regression.png](Images/logistic-regression.png)

  * Logistic regression is calculated by applying an activation function as the final step to our linear model. This transforms a numerical range to a bounded probability between 0 and 1.

    ![logistic-regression-activation-function.png](Images/logistic-regression-activation-function.png)

  * We can use logistic regression to predict which category or class a new data point should have.

    ![logistic_1.png](Images/logistic_1.png)
    ![logistic_2.png](Images/logistic_2.png)
    ![logistic_3.png](Images/logistic_3.png)
    ![logistic_4.png](Images/logistic_4.png)
 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

Generate some data

* The `make_blobs` function to generate two different groups (classes) of data. We can then apply logistic regression to determine if new data points belong to the purple group or the yellow group.

    ![make-blobs.png](Images/make-blobs.png)

  * We create our model using the `LogisticRegression` class from Sklearn.

    ![logistic-regression-model.png](Images/logistic-regression-model.png)

  * Then we fit (train) the model using our training data.

    ![train-logistic-model.png](Images/train-logistic-model.png)

  * And validate it using the test data.

    ![test-logistic-model.png](Images/test-logistic-model.png)

  * And finally, we can make predictions.

    ![new-data.png](Images/new-data.png)

    ![predicted-class.png](Images/predicted-class.png)

In [None]:
# make_blobs Generate isotropic Gaussian blobs for clustering.

from sklearn.datasets import make_blobs

X, y = make_blobs(centers=2, random_state=42)

print(f"Labels: {y[:10]}")
print(f"Data: {X[:10]}")

In [None]:
# Visualizing both classes
plt.scatter(X[:, 0], X[:, 1], c=y)

Split our data into training and testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)

Create a Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

Fit (train) or model using the training data

In [None]:
classifier.fit(X_train, y_train)

Validate the model using the test data

In [None]:
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Make predictions

In [None]:
# Generate a new data point (the red circle)
import numpy as np
new_data = np.array([[-0, 6]])
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.scatter(new_data[0, 0], new_data[0, 1], c="r", marker="o", s=100)

In [None]:
# Predict the class (purple or yellow) of the new data point
predictions = classifier.predict(new_data)
print("Classes are either 0 (purple) or 1 (yellow)")
print(f"The new point was classified as: {predictions}")

In [None]:
predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test})

# Students Turn Activity 2 Voice Gender Recognition

* In this activity, you will apply logistic regression to predict the gender of a voice using acoustic properties of the voice and speech.

## Instructions

* Split your data into training and testing data.

* Create a logistic regression model with sklearn.

* Fit the model to the training data.

* Make 10 predictions and compare those to the testing data labels.

* Compute the R2 score for the training and testing data separately.

- - -


# Voice Gender
Gender Recognition by Voice and Speech Analysis

This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).

## The Dataset
The following acoustic properties of each voice are measured and included within the CSV:

* meanfreq: mean frequency (in kHz)
* sd: standard deviation of frequency
* median: median frequency (in kHz)
* Q25: first quantile (in kHz)
* Q75: third quantile (in kHz)
* IQR: interquantile range (in kHz)
* skew: skewness (see note in specprop description)
* kurt: kurtosis (see note in specprop description)
* sp.ent: spectral entropy
* sfm: spectral flatness
* mode: mode frequency
* centroid: frequency centroid (see specprop)
* peakf: peak frequency (frequency with highest energy)
* meanfun: average of fundamental frequency measured across acoustic signal
* minfun: minimum fundamental frequency measured across acoustic signal
* maxfun: maximum fundamental frequency measured across acoustic signal
* meandom: average of dominant frequency measured across acoustic signal
* mindom: minimum of dominant frequency measured across acoustic signal
* maxdom: maximum of dominant frequency measured across acoustic signal
* dfrange: range of dominant frequency measured across acoustic signal
* modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
* label: male or female

* Logistic regression is used to predict categories or labels.



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import os

In [None]:
voice = pd.read_csv(os.path.join('Resources', 'voice.csv'))
voice.head()

In [None]:
# Assign X (data) and y (target)

X = voice.drop("label", axis=1)
y = voice["label"]
print(X.shape, y.shape)
print(X)
print(y)

In [None]:
Split our data into training and testing

In [None]:
# Split the data using train_test_split
# YOUR CODE HERE

In [None]:
Create a Logistic Regression Model

In [None]:
# Create a logistic regression model
# YOUR CODE HERE

In [None]:
Fit (train) or model using the training data

In [None]:
# Fit the model to the data
# YOUR CODE HERE

In [None]:
Validate the model using the test data

In [None]:
# Print the r2 score for the test data
# YOUR CODE HERE

In [None]:
Make predictions

In [None]:
# Make predictions using the X_test and y_test data
# Print at least 10 predictions vs their actual labels
# YOUR CODE HERE

# Explanations: 
  * The `stratify` parameter in `train_test_split` to obtain a representative sample of each category in our test data.
    ![statify.png](Images/stratify.png)

  * We will perform logistic regression to our dataset in order to predict the label `male` or `female`.

    ![gender-predictions.png](Images/gender-predictions.png)

# Instructor Turn Activity 3 Trees

* Decision Trees encode a series of True/False questions that can be interpreted as if-else statements

    ![decision-tree.png](Images/decision-tree.png)

    ![dtree-ifelse.png](Images/dtree-ifelse.png)

  * Decision trees have a depth: the number of `if-else` statements encountered before making a decision.

  * Decision trees can become very complex and very deep, depending on how many questions have to be answered. Deep and complex trees tend to overfit to the data and do not generalize well.

    ![tree.png](Images/tree.png)

* Random Forests:

  * Instead of one large, complex tree, you use many small and simple decision trees and average their outputs.

  * These simple trees are created by randomly sampling the data and creating a decision tree for only that small portion of data. This is known as a **weak classifier** because it is only trained on a small piece of the original data and by itself is only slightly better than a random guess. However, many "slightly better than average" small decision trees can be combined to create a **strong classifier**, which has much better decision making power.

  * Another benefit to this algorithm is that it is robust against overfitting. This is because all of those weak classifiers are trained on different pieces of the data.

    ![random-forest.png](Images/random-forest.png)

* Each node in the tree attempts to split the data based on some criteria of the input data. The top of the tree will be the decision point that makes the biggest split. Each sub-node makes a finer and finer grain decision as the depth increases.

  ![iris.png](Images/iris.png)

* Point out that the training phase of the decision tree algorithm learns which features best split the data.

* Explain a byproduct of the Random Forest algorithm is a ranking of feature importance (i.e. which features have the most impact on the decision).

* Scikit-Learn provides an attribute called `feature_importances_`, where you can see which features were the most significant.

  ```python
  sorted(zip(rf.feature_importances_, iris.feature_names), reverse=True)
  ```

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

In [None]:
# Load the Iris Dataset
iris = load_iris()
print(iris.DESCR)

In [None]:
# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=200)
rf = rf.fit(iris.data, iris.target)
rf.score(iris.data, iris.target)

In [None]:
# Random Forests in sklearn will automatically calculate feature importance
importances = rf.feature_importances_
importances

In [None]:
# We can sort the features by their importance
sorted(zip(rf.feature_importances_, iris.feature_names), reverse=True)

# Students Turn Activity # Trees

* In this activity, you will compare the performance of a decision tree to a random forest classifier using the Pima Diabetes DataSet.

## Instructions

* Use the Pima Diabetes DataSet and train a decision tree classifier to predict the diabetes label (positive or negative). Print the score for the trained model using the test data.

* Repeat the exercise using a Random Forest Classifier with SciKit-Learn. You will need to investigate the SciKit-Learn documentation to determine how to build and train this model.

* Experiment with different numbers of estimators in your random forest model. Try different values between 100 and 1000 and compare the scores.

- - -



In [None]:
from sklearn import tree
import pandas as pd
import os

In [None]:
df = pd.read_csv(os.path.join("Resources", "diabetes.csv"))
df.head()

In [None]:
target = df["Outcome"]
target_names = ["negative", "positive"]

In [None]:
data = df.drop("Outcome", axis=1)
feature_names = data.columns
data.head()

In [None]:
# Split the data using train_test_split
# YOUR CODE HERE

In [None]:
# Create a Decision Tree Classifier
# YOUR CODE HERE

In [None]:
# Fit the classifier to the data
# YOUR CODE HERE

In [None]:
# Calculate the R2 score for the test data
# YOUR CODE HERE

In [None]:
# Bonus
# Create, fit, and score a Random Forest Classifier
# YOUR CODE HERE

  * The accuracy improves slightly when using a random forest classifier. Change the number of estimators in the random forest model and re-compute the score to show how it changes.

    ![nestimators.png](Images/nestimators.png)


# Instructor Turn Activity KNN 

* The KNN algorithm is a simple, yet robust machine learning algorithm. It can be used for both regression and classification. However, it is typically used for classification.

  * Walk through the examples provided and show how K changes the classification. we use odd numbers for k so that there isn't a tie between neighboring points.

    ![k1.png](Images/k1.png)

    ![k3.png](Images/k3.png)

    ![k5.png](Images/k5.png)

    ![k7.png](Images/k7.png)

  * Finally, the `k` for KNN is often calculated computationally with a loop.

  * Point out that the best `k` value for this dataset is where the score is both accurate and has started to stabilize.

    ![knn-scores.png](Images/knn-scores.png)

    ![knn-plot.png](Images/knn-plot.png)

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [None]:
iris = load_iris()
print(iris.DESCR)

In [None]:
X = iris.data
y = iris.target

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [None]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScater model and fit it to the training data

X_scaler = StandardScaler().fit(X_train.reshape(-1, 1))

In [None]:
# Transform the training and testing data using the X_scaler and y_scaler models

X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# K Nearest Neighbors

In [None]:
# Loop through different k values to see which has the highest accuracy
# Note: We only use odd numbers because we don't want any ties
train_scores = []
test_scores = []
for k in range(1, 20, 2):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    train_score = knn.score(X_train_scaled, y_train)
    test_score = knn.score(X_test_scaled, y_test)
    train_scores.append(train_score)
    test_scores.append(test_score)
    print(f"k: {k}, Train/Test Score: {train_score:.3f}/{test_score:.3f}")
    
    
plt.plot(range(1, 20, 2), train_scores, marker='o', label='training')
plt.plot(range(1, 20, 2), test_scores, marker="x", label='testing')
plt.legend()
plt.xlabel("k neighbors")
plt.ylabel("Testing accuracy Score")
plt.show()

In [None]:
# Note that k: 9 provides the best accuracy where the classifier starts to stablize
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
print('k=9 Test Acc: %.3f' % knn.score(X_test, y_test))

In [None]:
new_iris_data = [[4.3, 3.2, 1.3, 0.2]]
predicted_class = knn.predict(new_iris_data)
print(predicted_class)

# Students Turn KNN Activity 

* In this activity, you will determine the best `k` value in KNN to predict diabetes for the Pima Diabetes DataSet.

## Instructions

* Calculate the Train and Test scores for `k` ranging from 1 to 20. Use only odd numbers for the k values.

* Plot the `k` values for both the train and test data to determine where the best combination of scores occur. This point will be the optimal `k` value for your model.

* Re-train your model using the `k` value that you found to have the best scores. Print the score for this value.

- - -


In [None]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import os

In [None]:
df = pd.read_csv(os.path.join("Resources", "diabetes 2.csv"))
df.head()

In [None]:
target = df["Outcome"]
target_names = ["negative", "positive"]

In [None]:
data = df.drop("Outcome", axis=1)
feature_names = data.columns
data.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=42)

In [None]:
# Loop through different k values to see which has the highest accuracy
# Note: We only use odd numbers because we don't want any ties
# YOUR CODE HERE


    
plt.plot(range(1, 20, 2), train_scores, marker='o')
plt.plot(range(1, 20, 2), test_scores, marker="x")
plt.xlabel("k neighbors")
plt.ylabel("Testing accuracy Score")
plt.show()

In [None]:
# Choose the best k from above and re-fit the KNN Classifier using that k value.
# print the score for the test data
# YOUR CODE HERE

* For this activity, `K=13` seems to be the best combination of both the train and test scores.

    ![knn-train-test.png](Images/knn-train-test.png)

* Ask students for any additional questions before moving on.


# Instructor Turn Activity 7 


### 12. Instructor Do: SVM (0:10)

  * The goal of a linear classifier is to find a line that separates two groups of classes. However, there may be many options for choosing this line and each boundary could result in misclassification of new data.

    ![linear-discriminative-classifiers.png](Images/linear-discriminative-classifiers.png)

    ![classifier-boundaries.png](Images/classifier-boundaries.png)

  * SVM try to find a hyperplane that maximizes the boundaries between groups. This is like building a virtual wall between groups where you want the wall to be as thick as possible.

    ![svm-hyperplane.png](Images/svm-hyperplane.png)


  * There are different kernels available for the SVM model in SciKit-Learn, but we are going to use the linear model in this example.

    ![svm-linear.png](Images/svm-linear.png)

  * The decision boundaries for the trained model. The algorithm shows how it maximized the boundaries between the two groups.

    ![svm-boundary-plot.png](Images/svm-boundary-plot.png)

  * Next, show an example of "real" data where the boundaries are overlapping. In this case, the svm algorithm will "soften" the margins and allow some of the data points to cross over the margin boundaries in order to obtain a fit.

    ![svm-soften.png](Images/svm-soften.png)

  * Generate a classification report to quantify and validate the model performance.

    ![svm-report.png](Images/svm-report.png)

  *  precision and recall to deep dive into the meaning behind each score.

![svm1.png](Images/SVM1.png)
![svm2.png](Images/SVM2.png)
![svm3.png](Images/SVM3.png)
![svm4.png](Images/SVM4.png)
![svm5.png](Images/SVM5.png)
![svm6.png](Images/SVM6.png)
![svm7.png](Images/SVM7.png)
![svm8.png](Images/SVM8.png)
![svm9.png](Images/SVM9.png)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from matplotlib import style
style.use("ggplot")
# from matplotlib import rcParams
# rcParams['figure.figsize'] = 10, 8

In [None]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=40, centers=2, random_state=42, cluster_std=1.25)
plt.scatter(X[:, 0], X[:, 1], c=y, s=100, cmap="bwr");
plt.show()

In [None]:
# Support vector machine linear classifier
from sklearn.svm import SVC 
model = SVC(kernel='linear')
model.fit(X, y)

In [None]:
# WARNING! BOILERPLATE CODE HERE!
# Plot the decision boundaries
x_min = X[:, 0].min()
x_max = X[:, 0].max()
y_min = X[:, 1].min()
y_max = X[:, 1].max()

XX, YY = np.mgrid[x_min:x_max, y_min:y_max]
Z = model.decision_function(np.c_[XX.ravel(), YY.ravel()])

# Put the result into a color plot
Z = Z.reshape(XX.shape)
# plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'],
            linestyles=['--', '-', '--'], levels=[-.5, 0, .5])
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k', s=100)
plt.show()

# Validation

In [None]:
X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=.95)
plt.scatter(X[:, 0], X[:, 1], c=y, s=100, cmap="bwr");
plt.show()

In [None]:
# Split data into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Fit to the training data and validate with the test data
model = SVC(kernel='linear')
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [None]:
# Plot the decision boundaries
x_min = X[:, 0].min()
x_max = X[:, 0].max()
y_min = X[:, 1].min()
y_max = X[:, 1].max()

XX, YY = np.mgrid[x_min:x_max, y_min:y_max]
Z = model.decision_function(np.c_[XX.ravel(), YY.ravel()])

# Put the result into a color plot
Z = Z.reshape(XX.shape)
# plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'],
            linestyles=['--', '-', '--'], levels=[-.5, 0, .5])
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k', s=100)
plt.show()

In [None]:
# Calculate classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions,
                            target_names=["blue", "red"]))

# Students Turn Activity 8 SVM

* In this activity, apply a support vector machine classifier predict diabetes for the Pima Diabetes DataSet.

## Instructions

* Import a support vector machine linear classifier and fit the model to the data.

* Compute the classification report for this model using the test data.

- - -

In [None]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import os

In [None]:
df = pd.read_csv(os.path.join("Resources", "diabetes.csv"))
df.head()

In [None]:
target = df["Outcome"]
target_names = ["negative", "positive"]

In [None]:
data = df.drop("Outcome", axis=1)
feature_names = data.columns
data.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=42)

In [None]:
# Create a support vector machine linear classifer and fit it to the training data
# YOUR CODE HERE

In [None]:
# Print the model score using the test data
# YOUR CODE HERE

In [None]:
# Calculate the classification report
# YOUR CODE HERE

  * The F1 scores indicate that this model is slightly more accurate at predicting negative cases of diabetes than positive cases.

    ![svm-f1.png](Images/svm-f1.png)

# Instructor Turn Activity 9 GridSearch
* The code for hyperparameter tuning with `GridSearchCV`.

  * The SVM model to highlight the different features available for the model. Each of these features can be adjusted and tweaked to improve model performance.

    ![svm-model.png](Images/svm-model.png)

  * In machine learning, there are few if any general rules on how to adjust these parameters. Instead, machine learning practitioners often use a brute force approach where they try different combinations of values to see which has the best performance. This is known as `hyperparameter tuning`

  * To simplify the hyperparameter tuning process, SciKit-Learn provides a tool called `GridSearchCV`. This class is known as a `meta-estimator`. That is, it takes a model and a dictionary of parameter settings and tests all combinations of parameter settings to see which settings have the best performance.

    ![grid-model.png](Images/grid-model.png)

    ![grid-fit.png](Images/grid-fit.png)
    
    ![C1](Images/C2.png)
    ![C1](Images/C1.png)
    ![C1](Images/C3.png)

  * Once the model has been trained, the best parameters can be accessed through the `best_params_` attribute.

    ![grid-best-params.png](Images/grid-best-params.png)

  * Similarly, the best score can be accessed through the `best_score_` attribute.

    ![grid-best-score.png](Images/grid-best-score.png)

  * The grid meta-estimator basically wraps the original model, so you can access the model functions like `predict`.

    ![grid-predict.png](Images/grid-predict.png)


In [None]:
import numpy as np
import matplotlib.pyplot as plt

from matplotlib import style
style.use("ggplot")
# from matplotlib import rcParams
# rcParams['figure.figsize'] = 10, 8

In [None]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=.95)
plt.scatter(X[:, 0], X[:, 1], c=y, s=100, cmap="bwr");
plt.show()

In [None]:
# Split data into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Create the SVC Model
from sklearn.svm import SVC 
model = SVC(kernel='linear')
model

In [None]:
# Create the GridSearch estimator along with a parameter object containing the values to adjust
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [1, 5, 10, 50],
              'gamma': [0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid, verbose=3)

In [None]:
# Fit the model using the grid search estimator. 
# This will take the SVC model and try each combination of parameters
grid.fit(X_train, y_train)

In [None]:
# List the best parameters for this dataset
print(grid.best_params_)

In [None]:
# List the best score
print(grid)
print(grid.best_score_)

In [None]:
# Make predictions with the hypertuned model
predictions = grid.predict(X_test)

In [None]:
# Calculate classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions,
                            target_names=["blue", "red"]))

# Student Turn Activity 10 Grid Search and Hyper-Parameter Tuning

* In this activity, you will revisit the SVM activity for the Pima Diabetes DataSet and apply `GridSearchCV` to tune the model parameters.

## Instructions

* Use the starter code provided and apply `GridSearchCV` to the model. Change the `C` and `gamma` parameters.

* For `C`, use the following list of settings: `[1, 5, 10]`.

* For `gamma`, use the following list of settings: `[0.0001, 0.001, 0.01]`.

* Print the best parameters and best score for the tuned model.

* Calculate predictions using the `X_test` data and print the classification report.

- - -

In [None]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import os

In [None]:
df = pd.read_csv(os.path.join("Resources", "diabetes.csv"))
df.head()

In [None]:
target = df["Outcome"]
target_names = ["negative", "positive"]

In [None]:
data = df.drop("Outcome", axis=1)
feature_names = data.columns
data.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=42)

In [None]:
# Support vector machine linear classifier
from sklearn.svm import SVC 
model = SVC(kernel='linear')

In [None]:
# Create the GridSearch estimator along with a parameter object containing the values to adjust
# Try adjusting `C` with values of 1, 5, and 10. Adjust `gamma` using .0001, 0.001, and 0.01
# YOUR CODE HERE

In [None]:
# Fit the model using the grid search estimator. 
# This will take the SVC model and try each combination of parameters
# YOUR CODE HERE

In [None]:
# List the best parameters for this dataset
# YOUR CODE HERE

In [None]:
# List the best score
# YOUR CODE HERE

In [None]:
# Make predictions with the hypertuned model
# YOUR CODE HERE

In [None]:
# Calculate classification report
# YOUR CODE HERE

* The Grid Search tested our model with 27 different combinations of parameters and data. 
* Applying GridSearch saves us considerable time vs manually changing these values ourselves.

  *  Knowing which parameters to tune and which values to use comes from both experience and Sklearn's documentation for their algorithms.

  * By simply tuning two of our hyperparameters, the model score increased from 0.729 to an accuracy score of 0.774!

    ![grid-score-diabetes.png](Images/grid-score-diabetes.png)

- - -