# First Exam

This is an book, open note exam. You may consult internet sources for Python/coding questions, but your code should be your own work with your own comments. You should not work with other students in the class (or anyone else!) As usual, please cite any sources that you used.

There are four topic questions and one reflection question. For each topic question, you should have a number of code cells. The code cells should be carefully commented to explain what you are doing. You should conclude each question with a markdown cell containing a summary of what you learned in that question. These should generally be written in paragraph form, with nice formatting, and refer to specific results from your code cells. The majority of the grade will be determined from how well you summarize and discuss your work and demonstrate an understanding of the algorithms being used. (Of course, you must have correctly functioning code to discuss!)

### Office Hours
Joanna or Tamara will be available in the office hour channel Monday Febraury 8, from 1-5. For questions at other times, please tag us in the office hour channel and we'll respond as soon as we can. 

### Submission Instructions
Submit your exam in Moodle by 9:30 am on Tuesday, February 9 and submit **both** a pdf of your notebook and a link (that we can access/edit) to your Colab notebook.

## Question 1 

Load the data set at url="http://facweb1.redlands.edu/fac/Tamara_Veenstra/ML/fruits.csv" as a pandas data frame. Clean up the data as necessary, then explore the data using various pandas and matplotlib techniques. You should have both graphs and statistical analysis. You will be applying kNN classification and PCA dimensionality reduction to this data set. So think about what information would be helpful to you in that process. In particular, you will pick two of the features to use for classification with kNN, so you may want to explore which features seem useful.

The concluding markdown cell should contain a summary about what you learned about the data set in your data exploration and any clean up work that was necessary.  

In [1]:
## IMPORTS FROM HW DAY 4 JUPYTER NOTEBOOK ##

# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import pandas as pd

# Start a list of useful packages
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from pandas.plotting import scatter_matrix

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [10]:
# Import the data
df = pd.read_csv(
    filepath_or_buffer='http://facweb1.redlands.edu/fac/Tamara_Veenstra/ML/fruits.csv', 
    header=None, 
    sep=',')
df.columns=['num', 'fruit_name', 'fruit_subtype', 'height', 'mass', 'width', 'color_score']
df.drop(0, inplace=True)
df.drop('num', axis=1, inplace=True)
df

Unnamed: 0,fruit_name,fruit_subtype,height,mass,width,color_score
1,apple,granny_smith,7.3,192,8.4,0.55
2,apple,granny_smith,6.8,180,8.0,0.59
3,apple,granny_smith,7.2,176,7.4,0.6
4,mandarin,mandarin,4.7,86,6.2,0.8
5,mandarin,mandarin,4.6,84,6.0,0.79
6,mandarin,mandarin,4.3,80,5.8,0.77
7,mandarin,mandarin,4.3,80,5.9,0.81
8,mandarin,mandarin,4.0,76,5.8,0.81
9,apple,braeburn,7.8,178,7.1,0.92
10,apple,braeburn,7.0,172,7.4,0.89


In [16]:
#Check for NaNs
df.isna().sum()

fruit_name       0
fruit_subtype    0
height           0
mass             0
width            0
color_score      0
dtype: int64

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59 entries, 1 to 59
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   fruit_name     59 non-null     object
 1   fruit_subtype  59 non-null     object
 2   height         59 non-null     object
 3   mass           59 non-null     object
 4   width          59 non-null     object
 5   color_score    59 non-null     object
dtypes: object(6)
memory usage: 3.2+ KB


Noticing that all of the datatypes in this dataframe are objects. We should change the last three columns to have floats as their datatype as they are just numbers.

In [12]:
# fixing the datatypes for the last four columns because this makes more sense
df["height"]=pd.to_numeric(df["height"],errors='coerce',downcast='float')
df["mass"]=pd.to_numeric(df["mass"],errors='coerce',downcast='integer')
df["width"]=pd.to_numeric(df["width"],errors='coerce',downcast='float')
df["color_score"]=pd.to_numeric(df["color_score"],errors='coerce',downcast='float')

In [14]:
# Making sure it worked properly
df.dtypes

fruit_name        object
fruit_subtype     object
height           float32
mass               int16
width            float32
color_score      float32
dtype: object

## Question 2

Prepare the data for classification based on fruit name. Convert the categorical fruit name column to a numerical column, split the data into training and testing it TWO ways: first using a random split, specify random state=0, and then using a stratified split. (Specifying the random state will ensure that it produces the same split every time you run it, and when we run it to grade it, so its important for consistency here!) Pick two of the features for classification and scale your features using StandardScalar. 

The concluding markdown cell should contain a summary of the different steps you used here and why they are important. Also describe the difference between the two different training/testing split methods. How important is scaling to the features you chose? How important is scaling for kNN versus PCA? Explain.

## Question 3

Apply kNN to classify the training and testing data for a variety of k for your choice of two features. Compute the scores on both training and testing and illustrate your results with multiple Decision Boundary graphs for several different values of k. including both training and testing data (but differentiating between them.) Examine the scores on both the training and testing set for a wide variety of values of k, and include a graph of these scores as a function of k. Which value of k works the best here? You should do this twice, once each for the different methods of splitting the data into training and testing sets. 

The concluding markdown cell should contain your analysis of how well kNN works for different values of k. What values of k make sense here and why? What happens as k increases? How do the decision boundary graphs change as k increases and why does that make sense. When is kNN underfitting and when is it overfitting? You should also compare and contrast results for the different splits into training and testing sets. Does one method work better than the other? Explain why or why not based on the training and testing data.
In general, how well does kNN work on this data set and what limitations or difficulties are there to applying kNN on this data set?


## Question 4

Apply PCA to convert all four features in the fruit data set to two features. Make sure to apply mean normalization first here! Include a graph of the reduced data set. Is it possible to interpret the new features in terms of the previous features? Apply kNN classification to this data set and compare to your previous results. Include appropriate decision boundary graphs.

The concluding markdown cell should describe what the purpose of PCA is and compare/contrast kNN both with and without PCA. What are the pros and cons of each? Is the same value of k best in both cases?

## Question 5    

How is class going for you so far? What's working well in the online world? What would help you learn more? What are you most excited about so far? It is not necessary to include any code here, but do include a markdown cell with a paragraph reflecting on the first three weeks of class.