<h3>k-Nearest Neighbors: Fit</h3>
Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df.



<h3>k-Nearest Neighbors: Predict</h3>
Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new. You will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen. Using .predict() on X_new will generate 1 prediction, while using it on X will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as df. This time, you will create the feature array X and target variable array y yourself.



Great work! Did your model predict 'democrat' or 'republican'? How sure can you be of its predictions? In other words, how can you measure its performance? This is what you will learn in the next video.



## Measuring Model Performance

<h3>How to measure Performance</h3>

<ul>
    <li>In Classification, accuracy is common used matrix</li>
    <li>Accuracy, Fraction of correct predictions</li>
    <li>Which data should be used to complete accuracy</li>
    <li>How well our model to perform on new data?</li>
    <li>Could compute accuracy on data used to fit classifier</li>
    <li>Not indicative of ability to generalized</li>
    <li>Split data into training set and test set</li>
    <li>Fit/train classifier on the training set</li>
    <li>Make prediction on test set</li>
    <li>Compare prediction with the known label</li>
</ul>


<h4>Model Complexity</h4>
<ul>
    <li>Large k=smooth desicion boundary = less complex model</li>
    <li>Smaller k=more complex model = can lead to overfitting</li>
</ul>

<h2>The digits recognition dataset</h2><br>
Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.<br><br>

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type Bunch, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images' key in addition to the 'data' and 'target' keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this 'images' key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see Chapter 2 of DataCamp's course on Data Visualization with Python). On the other hand, the 'data' key contains the feature array - that is, the images as a flattened array of 64 pixels.<br><br>

Notice that you can access the keys of these Bunch objects in two different ways: By using the . notation, as in digits.images, or the [] notation, as in digits['images'].<br><br>

For more on the MNIST data, check out this exercise in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.<br>

<h3>Instruction</h3>
<ul>
    <li>Import datasets from sklearn and matplotlib.pyplot as plt.</li>
    <li>Load the digits dataset using the .load_digits() method on datasets.</li>
    <li>Print the keys and DESCR of digits.</li>
    <li>Print the shape of images and data keys using the . notation.</li>
    <li>Display the 1011th image using plt.imshow(). This has been done for you, so hit 'Submit Answer' to see which handwritten digit this happens to be!</li>
</ul>

Good job! It looks like the image in question corresponds to the digit '5'. Now, can you build a classifier that can make this prediction not only for this image, but for all the other ones in the dataset? You'll do so in the next exercise!

<h3>Train/Test Split + Fit/Predict/Accuracy</h3>
<br />
Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the .score() method.


<h4>Instruction</h4>
<ul>
    <li>Import KNeighborsClassifier from sklearn.neighbors and train_test_split from sklearn.model_selection.</li>
    <li>Create an array for the features using digits.data and an array for the target using digits.target.</li>
    <li>Create stratified training and test sets using 0.2 for the size of the test set. Use a random state of 42. Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.</li>
    <li>Create a k-NN classifier with 7 neighbors and fit it to the training data.</li>
    <li>Compute and print the accuracy of the classifier's predictions using the .score() method.</li>
</ul>

Excellent work! Incredibly, this out of the box k-NN classifier with 7 neighbors has learned from the training data and predicted the labels of the images in the test set with 98% accuracy, and it did so in less than a second! This is one illustration of how incredibly useful machine learning techniques can be.

<h3>Overfitting and underfitting</h3>
<br>
Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.<br><br>

The training and testing sets are available to you in the workspace as X_train, X_test, y_train, y_test. In addition, KNeighborsClassifier has been imported from sklearn.neighbors.

<h4>Exercise</h4>
<ul>
    <li>Inside the for loop:</li>
    <ul>
        <li>Setup a k-NN classifier with the number of neighbors equal to k.</li>
        <li>Fit the classifier with k neighbors to the training data.</li>
        <li>Compute accuracy scores the training set and test set separately using the .score() method and assign the results to the train_accuracy and test_accuracy arrays respectively.</li>
    </ul>
</ul>

Great work! It looks like the test accuracy is highest when using 3 and 5 neighbors. Using 8 neighbors or more seems to result in a simple model that underfits the data. Now that you've grasped the fundamentals of classification, you will learn about regression in the next chapter!

