# Assignment 3: 75 points (+ 25 extra credit)
## Exploratory Analysis with Yellowbrick, Confusion Matrix

### IMPORTANT: 
#### You MUST read everything in this notebook CAREFULLY, including ALL code comments.  If you do not, then you may easily make mistakes.

This week we will use a Python package called 'yellowbrick', which has some very pleasing data visualizations.  You will need to install the package in order for Task 1 to work properly.  Be sure to review the class slides if you need to. (But read the comments in the next cell first.)

You may need to consult the following documentation URLs in order to complete this assignment:

https://pypi.org/project/yellowbrick/

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html 





### Important note:

The original version of this notebook used a package called 'beepy', but that package will not work if you are using a Python version higher than 3.7.  Thus, the notebook was updated to use a package called 'playsound'.  You must do the following before running the next code cell:

1. In your anaconda environment that you created for this class, open up a terminal window and run the command: pip3 install PyObjC


2. AFTER you do that, then run: pip3 install playsound

The reason for pyObjC is because playsound needs it for efficiency. If you install playsound before installilng PyObjC you will get an error when you try to use playsound, so do it in the order shown above

4. Optional: If you want to use your OWN sound file instead of yourcodeisdonerunning.m4a that is fine, but you'll need to record it yourself and make sure that the 'playsound' function refers to it properly with whatever file name you use for it.

In [1]:
# Task 1: 5 points.  Set up environment

# If some of these do not import properly, you may need to install them and re-run
# For example, the yellowbrick package is available in Anaconda, 
# but it's not installed by default. So either install it into your environment 
# using Navigator's UI, or use a terminal to install it with 'pip3 install yellowbrick' 

import keras
import playsound
import sklearn
import tensorflow
import time

import matplotlib         as mpl   # for graphing
import matplotlib.pyplot  as plt
import numpy              as np    # for fast vector and matrix operations
import pandas             as pd

from keras.datasets          import cifar10  # The Keras package comes with several datasets, incl. CIFAR10
from playsound               import playsound
from pprint                  import pprint   # pprint means 'pretty print'.  You'll see why when we use it.
from sklearn.linear_model    import SGDClassifier, LogisticRegression
from sklearn.metrics         import confusion_matrix, precision_recall_curve, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_predict, cross_val_score, GridSearchCV
from yellowbrick.classifier  import ClassBalance, ClassificationReport, ClassPredictionError, ConfusionMatrix, ROCAUC

np.random.seed(42) # for reproducibility
# The next line tells Jupyter to show all plots inside the notebook
%matplotlib inline 

'Done'

'Done'

In [None]:
# Task 2: 5 points

# Load and preprocess the CIFAR-10 dataset using the same variables
# as in Assignment 2, including X_train, X_train_flat, y_train
# as well as the corresponding variables for the test data.
# Don't forget to include np.random.seed(42) in the beginning.
# We defined and used LABEL_NAMES in Assignment 1, but did not use
# it in Assignment 2.  However, we will be using it again here,
# so make sure you copy it from Assignment 1 and add it here.

#################### Insert your code below for 5 points ###############


########################### Your code ends above ###########################

'Done' 

In [None]:
# Function to show randomly selected pictures
# There is no task for you to write in this cell, 
# but it would be useful for you to study the code.

def pictureGrid(rows, cols, labels, picData, picLabels, picPreds=None, predProbs=None, predsFlag=False):
    '''Show random picture grid with labels, optionally with predicted label and its prediction probability.'''
    # rows:       integer number of rows of pictures to display in grid
    # cols:       integer number of pictures per row to display in grid
    # labels:     list of picture labels
    # picData:    matrix of pictures, one flattened picture vector per row
    # picLabels:  ground truth of picture labels for picData
    # picPreds:   predictions of picData from some model (predictions are integer indices of labels)
    # predProbs:  corresponding probabilities of the predictions
    # predsFlag:  boolean to indicate if predictions are included in function call
    
    figure = plt.figure(figsize=(2 * cols - 1, 3 * rows - 1))
    for r in range(rows):
        for c in range(cols):
            randomIndex = np.random.randint(0, len(picLabels))    # get index of random pic 
            ax = figure.add_subplot(rows, cols, c * rows + r + 1) # set up picture grid for display
            ax.grid('off')
            ax.axis('off')
            ax.imshow(picData[randomIndex].reshape(32, 32, 3))    # convert flattened pic back to shape (32, 32, 3)
            if predsFlag:
                picLabel  = labels[picPreds[randomIndex]]         # get predicted label of test pic
                predProb  = predProbs[randomIndex]                # get probability of test pic prediction
            gtLabel = labels[picLabels[randomIndex]]              # ground truth label
            if predsFlag:
                ax.set_title("image: {}\npred: {}\nprob: {:.3}\ngt: {}".format(randomIndex, picLabel, predProb, gtLabel))
            else:
                ax.set_title("image: {}\ngt: {}".format(randomIndex, gtLabel))
    plt.show()
    
pictureGrid(4, 5, LABEL_NAMES, X_test_flat, y_test)

In [None]:
# Task 3: 10 points

# Build another SGD Classifier using a different loss function.

# We have already used SGDClassifier in the previous assignments.
# By default, it uses parameter: loss='hinge' but in 
# Assignment 2 we discovered that modified_huber gave us better
# results.

# Another difference is that if we want to access the confidence scores 
# of our models, they are not available with hinge loss. 
# They are available with 'modified_huber' as our loss function. 

# The confidence scores will tell you how confident the ML algorithm is
# for each of categories.  We will see this in the next cell.

# You can keep the default values for all other arguments, i.e. you
# won't need to mention any of the others.

# Add your code below to do the following: (2 points each)
# You can look at Assignment 2 to refresh your memory for 
# much of this.
# 1. create an SGDClassifier using a modified_huber loss
#    function, a random state of 42, and use all cores so
#    you don't have to wait too long for this to train.
# 2. Fit your model on the flattened training data with ground truth labels
# 3. Add the code to capture timing and print the elapsed time, as before.
# 4. Print the score (accuracy) of the trained model on the test data.
#    You have already done that in Assignment 1, so go find it if
#    you do not remember.  Add some text in your print statement
#    so we know what we are looking at.
# 5. Notify yourself when this is done with playsound: e.g. 
#    just insert the code: playsound('yourcodeisdonerunning.m4a')

#################### Insert your code below for 10 points,  ###############
####################    distributed as described above      ###############
# Use the variables shown here, but there are additional lines of 
# code that you'll have to add.
startTime  = 

sgdModHub  = 
                          


stopTime   =                                   


########################### Your code ends above ###########################

# In Assignment 2 we were able to get close to 40%
# accuracy, but with this model you will see something 
# significantly worse than that.


In [None]:
# Task 4: 20 points

# We used modified_huber to have access to the confidence values of the predictions.
# Here is where we obtain those values using the predict_proba() method.
# Not all models will provide this method (e.g. LinearSVC ), but many do.  
# Knowing that predict_proba is a method should be enough information for you
# to use it correctly, but you can consult the web or the documentation
# for SGDClassifier on sklearn for more info.

testImage = 7916

# 5 points: Get the probabilities of all classes for every test image
# 5 points: Then print those probabilities for the testImage

#################### Insert your code below for 10 points,  ###############

testProbs = 


########################### Your code ends above ###########################

# The highest value of these prediction probabilities for a given image corresponds 
# to the class that the model is most confident about.                                    
# So let's get the index of the maximum value from that test image's prediction probabilities
testImageProbs        = testProbs[testImage]              # Prediction probabilities for our testImage
testImageMaxProbIndex = testImageProbs.argmax()           # Index of the max value from that array of 10 probabilities
print('\nIndex of predicted class:', testImageMaxProbIndex) # This should print a 4


# The 5th array element is the one with index 4 since Python uses 0-based array indexing.  

predictedClass = LABEL_NAMES[testImageMaxProbIndex]
print('\nPredicted class is: ' + predictedClass)      # This should print 'deer'  

# You should see output where the 5th element of the array has the highest value, about 0.55.
# Since the 5th element of LABEL_NAMES is 'deer', you should be looking at a picture
# of a deer after you run this cell.  You should also see in the probability values
# from the print statement that there is only one other class that has
# a non-zero probability and that class is 'frog'.  So the classifier
# is a little bit more confident about the picture being a deer than it is a frog.

print('Low resolution picture (hopefully) of a ', predictedClass)

# Let's view that picture to see if our model got the correct prediction

# Use matplotlib's pyplot to show the picture of testImage.
# Go up and look how you imported pyplot so you reference it correctly here.

# Also, remember that testImage is just an index into the matrix X_test_flat
#    If you take a look at Assignment 1, you'll find that I gave 
#    you almost the exact code that you'll need to use here.
      
# You will need to use 4 pyplot operations: (see Assignment 1!!!)
#    a. 2 points: Use pyplot's 'figure' to set a figure size of 2 by 2
#    b. 4 points: Use pyplot's 'imshow' to display the test image.
#       Make sure that you reference X_test_flat to get this image!
#       If you reference X_test I will not give you the 4 points.
#       But you need to recall from Assignment 1 that when you flattened
#       the images I told you in a comment:
#       'Note: If you want to show the images again after flattening them,'
#       'you'll have to reshape them back to their original (32, 32, 3)   '
#    c. 2 points: Turn the pyplot axis 'off'
#    d. 2 points: Call pyplot's 'show' to finally print the image into the notebook.

#################### Insert your code below for 10 points   ###############



########################### Your code ends above ###########################

# If you used random_state=42 in your SGD classifier above
# AND if the model's prediction is correct, then
# that picture should be a deer.

# If you try the above code with 'testImage = 4700' instead of 7916, 
# you will see a different picture, and the model's prediction should be wrong,
# so you'll see a different predictedClass.  Try it.

In [None]:
# Task 5: 10 points

# This cell assumes: 
# from yellowbrick.classifier import ClassBalance

# We will use several tools from the yellowbrick package.  
# In addition to the URL at the beginning of this notebook
# you can read about yellowbrick here:
# https://www.scikit-yb.org/en/latest/

# And here is where you can see the details of ClassBalance for this cell:
# https://www.scikit-yb.org/en/latest/api/target/class_balance.html 

# The tool ClassBalance shows how the data is distributed.
# e.g. Is 90% of the data in only one category or evenly distributed?
# Your analysis may differ depending on the answer to such questions.
# For example, in Chapter 2 see Géron's discussion of stratified sampling.
# If you remember the description of the CIFAR-10 dataset the chart
# created should not be a surprise to you.

# So let's look at class balance for the training data.

# Create a class balance chart.  
# 1. 6 points: Show the class balance of the training data. Show the names 
#    e.g. cat, ship, etc as bar labels. Give the chart a custom size. 
#    of 640 by 480 pixels by using the keyword argument: size=(640, 480)
#    You can just use the default bar colors, and show 
#    either the default chart title OR, if you wish, you can give it
#    your own custom title using the keyword 'title'.
#    Save your ClassBalance description into the variable cifar10Balance
# 2. 2 points: Call the fit method on cifar10Balance, passing it y_train
# 3. 2 points: Call the show method on cifar10Balance

########################### Your code starts here: 10 points ###########################

# Show the first chart below this line, approx. 3 lines of code:
cifar10Balance = 




################################## Your code ends here ################################

# In the plots you will see 'support' as the y-axis label.  
# The term 'support' in this chart refers to the number of data instances.

In [None]:
# Task 6: 10 points

# CONFUSION MATRIX

# You can view a basic confusion matrix by passing the ground truth and 
# the corresponding predictions of your model as the two arguments 
# to sklearn's confusion_matrix function.

# 1. 4 points: To do this, first call the predict method of your 
#    sgdModHub passing X_test_flat as the method's argument.
#    Convert the results to an np.array and save that value into
#    the variable modelPredictions.
# 2. 4 points: Call confusion_matrix, passing it the ground truth  
#    labels y_test and those modelPredictions as two arguments, and  
#    save the confusion matrix into the variable confMat
# 3. 2 points: Display the value of confMat as the output of the cell.

#################### Insert your code below for 10 points   ###############

modelPredictions = 
confMat          = 


########################### Your code ends above ###########################

# We can see in the main diagonal that the correct class is OFTEN smaller than 
# one or more of the others, which is really not a surprise, given that
# sgdModHub is not a very good predictor of the images.

# Also notice that, while sklearn's confusion_matrix is easy to use, it's bad 
# because it does not tell you if the ground truth values are displayed
# on the rows or on the columns.  

# Fortunately, there are much better ways to view confusion matrices.

In [None]:
# Yellowbrick's Confusion Matrix (with Heat Map!)

# The yellowbrick package makes visualizating model statistics 
# much easier (and better!) than using standard sklearn tools. 
# You can learn more about yellowbrick at: https://www.scikit-yb.org/en/latest/index.html 
# Yellowbrick uses Matplotlib "under the covers" to create the plots.

cmTrainedModel = ConfusionMatrix(sgdModHub, 
                                 classes=LABEL_NAMES, 
                                 size   =(640, 480))   
                                 # The size of the plot in pixels.
                                 # Omit it for the smaller default size

# To create the ConfusionMatrix, we need to make some predictions on the test data.
# The score method runs predict() on the data, calculates the accuracy score,
# and then creates the confusion_matrix from scikit-learn.
print('\nAccuracy of cmTrainedModel model on test data:',
      cmTrainedModel.score(X_test_flat, y_test), '\n')

cmTrainedModel.show() # Call the 'show' method to display that confusion matrix

# Check it out now! Notice how much more informative this is.
# The heat map shows darker colored cells for higher numbers 
# of pictures assigned to the categories of the labels in the columns.  
# And now you can see the rows clearly labeled as ground truth, 
# while the columns are clearly labeled as the predictions.

# At a glance, you can see from the colors that sgdModHub is a bad classifier.
# It seems like almost everything looks like a deer to sgdModHub!!

In [None]:
# Task 7: 15 points

# Now we begin with a model that is not yet trained.

# Using the previous cell as a guide, add your code to do the following:
# 1. 5 points: Create a new SGDClassifier that again uses modified_huber
#    but now set the alpha argument to 0.2, which will greatly
#    improve the accuracy.  Set n_jobs if you want to speed up training
#    and use a random state of 42.
#    Save this untrained classifier into variable sgdUntrained
# 2. 5 points: call ConfusionMatrix, passing your 
#    untrained model sgdUntrained as the first argument.
#    Other arguments should be the label names for classes,
#    a size of 720 by 540 pixels for your confusion matrix,
#    and a custom title (just set keyword 'title' to a string 
#    that you think is appropriate for this task.)
#    Save the confusion matrix object into variable cmBetter
# 3. 3 points: Call the fit method on cmBetter to train your model
#    on the training data.
# 4. 1 point: print an approprite message along with the accuracy
#    score on the test data
# 5. 1 point: call the show method on cmBetter


#################### Insert your code below for 15 points   ###############

sgdUntrained = 


cmBetter    = 


########################### Your code ends above ###########################

print('\n Class counts of the test data:', cmBetter.class_counts_) 
# If you try this with the pretrained model in the previous cell you'd get an error,                                                                    
# which is what the warning message of the previous cell was all about.

# If you did this correctly, then 
# you can see quickly that the darker colors are along the main diagonal, 
# which is what we want for a good classifier -- the darker the better.
# You can see that the predictions for 'bird' are not as good as the other 
# categories since the 'bird' cell on the main diagonal is lighter 
# in color than the others on that diagonal, and the best predictions 
# are for 'airplane', 'ship' and 'truck' because they are the darkest.

### Extra Credit: 25 points total divided between Two Tasks of 15 and 10 points, respectively.

If you wish to do this task, you will combine what we learned above with grid search and 3-fold cross validation that we did in Assignment 2. 

In [None]:
# Optional Task 8: 15 extra credit points

# For this you will create a totally different model using an algorithm
# called LogisticRegression.  Even though it's called regression,
# it can be used as a classifier.  You have already imported it
# in the first cell above.  The first thing to do is call
# LogisticRegression with 2 arguments: n_jobs=-1 and a random
# state of 42.  You can use the default values for everything else.
# Save the value of that model in the variable logReg

# LogisticRegression runs a LOT more slowly than SGDClassifier 
# for multi-class classification (you'll understand why when we study it 
# soon in class), so we will only do grid search on ONE hyperparameter 
# and we will only test 3 values. Just like SGDClassifier uses L2
# regularizaation, so does LogisticRegression, but in LogisticRegression
# the hyperparameter that controls the regularization is called 'C'.
# Thus, we will create a grid called 'C_grid' using 'C' as the key
# and a list of values to try for C during cross validation.  Since
# LogisticRegression is so slow, we will only test 3 values:
# [0.5, 1.0, 1.5]  Use that list as the value for key C in C_grid.
# The value 1.0 is the default value, so we will see if 0.5 or 1.5
# might work a little better when compared to 1.0

# You can expect this to take, as I said, a LOT longer than SGDClassifier
# even with only 3 values being tried out for cross-validation.  To save
# some more time, set the cross validation to do only 3-fold.  If you
# don't set it, the default value is to do 5-fold cross validation and 
# you will be waiting even longer! On my computer it took almost 20 minutes 
# for just 3 folds, so be prepared to wait for it!

# Capture your start and end times, print out the elapsed time and use 
# playsound to alert yourself when training is done.

# Also, print out your grid's best_estimator_
# After you see these results, continue on to the next cell.

# NOTE: You will probably see numerous warning messsages in the output,
# but they are not errors and you can ignore them. HOWEVER, if you DO
# see an actual error message then you did something wrong that you
# will have to fix!

#################### Insert your code below for 15 points ##################

C_grid      = 
                                                     
logReg      = 
gridSearch  = 

startTime   = 


stopTime    = time.perf_counter()                             # Capture the ending time


########################### Your code ends above ###########################

In [None]:
# Optional Task 9: 10 Extra credit points

# Now that you know which combination of values gives you the best estimator,
# REDEFINE the logReg variable using those best values from the printout.  
# Now, set it up in the same way as you did above to create 
# a confusion matrix, and call its fit method on the training data. 
# Then, evaluate the trained model on the test data,  
# show the confusion matrix, and print the accuracy score to see 
# how it compares with the previous models you have trained. 

#################### Insert your code below for 15 points ##################

logReg    = 


########################### Your code ends above ###########################

# This may show a bit better accuracy score than what you have seen previously
# If you compare this confusion matrix with the previous one, you'll see that
# there are a lot more correct predictions for 'bird', hence the improvement
# in overall accuracy.  Others do better too, but ship and truck are slightly
# worse when I trained this model.