Copyright 2022 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# KNN classification: Problem solving

In this session, you'll work through a complete example using a new dataset, `binary`.

## Load the dataframe

The `binary.csv` dataset contains 4 variables:

| Variable    | Type    | Description           |
|:-------------|:---------|:-----------------------|
| admit | Nominal   | the admittance status (0=not admitted, 1=admitted) |
| gre  | Ratio   | the student's GRE score  |
| gpa | Ratio   | the student's GPA |
| rank  | Ordinal   | rank of the institution (1=highest to 4=lowest prestige)  |


Start by loading `readr` and `dplyr`.

In [1]:
library(readr)
library(dplyr)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="_;PP-/]_2fNUR.dyhw(8">readr</variable><variable id="`IEAx*Bh}E,Y}mK;jr;{">dplyr</variable></variables><block type="import_R" id="q]np1Ju|B`4k*R-zylwU" x="44" y="66"><field name="libraryName" id="_;PP-/]_2fNUR.dyhw(8">readr</field><next><block type="import_R" id="(%@0XC,((6M%4]kj+iDm"><field name="libraryName" id="`IEAx*Bh}E,Y}mK;jr;{">dplyr</field></block></next></block></xml>


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Load a dataframe with `binary.csv` and display the dataframe.

*Note: It's useful to list your factor levels with the target level first, i.e. "1". Our performance metrics will assume that the first level is the target.*

In [16]:
dataframe = readr::read_csv("datasets/binary.csv",col_types= list(admit = col_factor(c("1", "0"))))

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(*,(U(rm+VG0+vg;w$65">dataframe</variable><variable id="_;PP-/]_2fNUR.dyhw(8">readr</variable></variables><block type="variables_set" id="gM*jw`FfIR3)8=g0iEB7" x="28" y="220"><field name="VAR" id="(*,(U(rm+VG0+vg;w$65">dataframe</field><value name="VALUE"><block type="varDoMethod_R" id=",vaW{t?FHN1~E?+,h!w-"><mutation items="2"></mutation><field name="VAR" id="_;PP-/]_2fNUR.dyhw(8">readr</field><field name="MEMBER">read_csv</field><data>readr:read_csv</data><value name="ADD0"><block type="text" id="dfrpI5b@DHr+DQ:|@vpv"><field name="TEXT">datasets/binary.csv</field></block></value><value name="ADD1"><block type="valueOutputCodeBlock_R" id="z6B+)So*d0^OjD*`zeoy"><field name="CODE">col_types=</field><value name="INPUT"><block type="lists_create_with" id="W|E4(pWg{_lEMJw9C$kC"><mutation items="1"></mutation><value name="ADD0"><block type="dummyOutputCodeBlock_R" id="!}{|N3Q;s2%t{*`C5Ky/"><field name="CODE">admit = col_factor(c("1", "0"))</field></block></value></block></value></block></value></block></value></block><block type="variables_get" id="dn{+Q#DO%lN;G_tFGJ#B" x="8" y="304"><field name="VAR" id="(*,(U(rm+VG0+vg;w$65">dataframe</field></block></xml>

admit,gre,gpa,rank
<fct>,<dbl>,<dbl>,<dbl>
0,380,3.61,3
1,660,3.67,3
1,800,4.00,1
1,640,3.19,4
0,520,2.93,4
⋮,⋮,⋮,⋮
0,620,4.00,2
0,560,3.04,3
0,460,2.63,2
0,700,3.65,2


## Prepare the train/test data

To train the classifiers, you need to split the dataframe into training data and testing data.

Start by importing `rsample`.

In [17]:
library(rsample)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="~~-I(f=60)#JfKGvV_AP">rsample</variable></variables><block type="import_R" id="g^doBJYp/fk!)^uuYnUf" x="-280" y="10"><field name="libraryName" id="~~-I(f=60)#JfKGvV_AP">rsample</field></block></xml>

Now split the data into training and testing data, using `test_size` at one of the following 3 values depending on your birthday:

If your birthday is in:

- Jan, Feb, Mar, Apr, use `0.2` 
- May, Jun, Jul, Aug, use `0.4`
- Sep, Oct, Nov, Dec, use `0.6`

So depending on your birthday, we'll use 20, 40, or 60% of the data for testing.

In [18]:
data_split = rsample::initial_split(dataframe,prop=.80)
data_train = rsample::training(data_split)
data_test = rsample::testing(data_split)

data_train

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="bQ!4E:J!~]0(]7KV]m@=">data_split</variable><variable id=":iMr},W7(N7vSLAUw!ao">data_train</variable><variable id="~~-I(f=60)#JfKGvV_AP">rsample</variable><variable id="(*,(U(rm+VG0+vg;w$65">dataframe</variable><variable id="|q$XCeTWL%AdgT|]tbnU">data_test</variable></variables><block type="variables_set" id="s!g),aa^(]dox/f`@P!y" x="-116" y="313"><field name="VAR" id="bQ!4E:J!~]0(]7KV]m@=">data_split</field><value name="VALUE"><block type="varDoMethod_R" id="hPsr6}9C/VNgaLsKuR,o"><mutation items="2"></mutation><field name="VAR" id="~~-I(f=60)#JfKGvV_AP">rsample</field><field name="MEMBER">initial_split</field><data>rsample:initial_split</data><value name="ADD0"><block type="variables_get" id="]~#@ltf];dTom_%pzV4n"><field name="VAR" id="(*,(U(rm+VG0+vg;w$65">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="R`?vH79hsA6Duxa9)AFX"><field name="CODE">prop=.80</field></block></value></block></value><next><block type="variables_set" id="3J6#JDFV0wE?V;NuM=?L"><field name="VAR" id=":iMr},W7(N7vSLAUw!ao">data_train</field><value name="VALUE"><block type="varDoMethod_R" id="X|Q7lj,pD_9W{%^.xd7h"><mutation items="1"></mutation><field name="VAR" id="~~-I(f=60)#JfKGvV_AP">rsample</field><field name="MEMBER">training</field><data>rsample:training</data><value name="ADD0"><block type="variables_get" id="JFCmHyJPiN`qwnlE~:iT"><field name="VAR" id="bQ!4E:J!~]0(]7KV]m@=">data_split</field></block></value></block></value><next><block type="variables_set" id="Y]ag(g~}tkN6:_X*]6P{"><field name="VAR" id="|q$XCeTWL%AdgT|]tbnU">data_test</field><value name="VALUE"><block type="varDoMethod_R" id="WBYo8G|ZcojJAqETRnv`"><mutation items="1"></mutation><field name="VAR" id="~~-I(f=60)#JfKGvV_AP">rsample</field><field name="MEMBER">testing</field><data>rsample:testing</data><value name="ADD0"><block type="variables_get" id="p^~x9|Zj((6qaUVvj#.E"><field name="VAR" id="bQ!4E:J!~]0(]7KV]m@=">data_split</field></block></value></block></value></block></next></block></next></block><block type="variables_get" id="9j){6[r67+7OFx`a~K[Y" x="-115" y="515"><field name="VAR" id=":iMr},W7(N7vSLAUw!ao">data_train</field></block></xml>

admit,gre,gpa,rank
<fct>,<dbl>,<dbl>,<dbl>
0,660,3.31,4
0,640,3.12,3
1,780,3.80,3
0,340,3.15,3
0,500,3.31,3
⋮,⋮,⋮,⋮
1,660,3.88,2
1,800,3.70,1
0,520,3.10,4
0,800,3.15,4


## KNN

First load `parsnip` and `generics`.

In [19]:
library(parsnip)
library(generics)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="3q]Js%*Alzd]|p|FOe}-">parsnip</variable><variable id="w(9-o9gLSDEJ,]Qt}e!^">generics</variable></variables><block type="import_R" id="Tkh?^4ccrGs0mL!EM3hu" x="-101" y="-34"><field name="libraryName" id="3q]Js%*Alzd]|p|FOe}-">parsnip</field><next><block type="import_R" id=".Hs/97T-2cD7?pjtke5p"><field name="libraryName" id="w(9-o9gLSDEJ,]Qt}e!^">generics</field></block></next></block></xml>

Next define the KNN model and fit it.

In [20]:
model = parsnip::nearest_neighbor(neighbors = 5) %>%
    parsnip::set_mode("classification") %>%
    parsnip::set_engine("kknn") %>%
    parsnip::fit.model_spec(admit ~ .,data = data_train)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="mgo;O)iX^5)A5.@gqIkA">model</variable><variable id="3q]Js%*Alzd]|p|FOe}-">parsnip</variable></variables><block type="variables_set" id="*(u89h_-M@(PB-!qP/1l" x="-94" y="176"><field name="VAR" id="mgo;O)iX^5)A5.@gqIkA">model</field><value name="VALUE"><block type="pipe_R" id="Oj1:/l+xksT^dSM;w{,g"><mutation items="3"></mutation><value name="INPUT"><block type="varDoMethod_R" id="m!?PwvZD%AJHhH1paJ[8"><mutation items="1"></mutation><field name="VAR" id="3q]Js%*Alzd]|p|FOe}-">parsnip</field><field name="MEMBER">nearest_neighbor</field><data>parsnip:nearest_neighbor</data><value name="ADD0"><block type="dummyOutputCodeBlock_R" id="m*FRi~byi7Ob_,Q%6]=$"><field name="CODE">neighbors = 5</field></block></value></block></value><value name="ADD0"><block type="varDoMethod_R" id="ady`8N}J*2BV_*AmtF`n"><mutation items="1"></mutation><field name="VAR" id="3q]Js%*Alzd]|p|FOe}-">parsnip</field><field name="MEMBER">set_mode</field><data>parsnip:set_mode</data><value name="ADD0"><block type="text" id="#4#x{=R9!%aA:_,cHnF("><field name="TEXT">classification</field></block></value></block></value><value name="ADD1"><block type="varDoMethod_R" id="bybPF(gahfhB3cmyy;/n"><mutation items="1"></mutation><field name="VAR" id="3q]Js%*Alzd]|p|FOe}-">parsnip</field><field name="MEMBER">set_engine</field><data>parsnip:set_engine</data><value name="ADD0"><block type="text" id="v9=Ry*6UUQjzFr5Sy@,f"><field name="TEXT">kknn</field></block></value></block></value><value name="ADD2"><block type="varDoMethod_R" id="Q61XWF.0ty]Aw^i83YUS"><mutation items="2"></mutation><field name="VAR" id="3q]Js%*Alzd]|p|FOe}-">parsnip</field><field name="MEMBER">fit.model_spec</field><data>parsnip:fit.model_spec</data><value name="ADD0"><block type="dummyOutputCodeBlock_R" id="n(du-Q]_KTwoCFj6n|V5"><field name="CODE">admit ~ .</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="lIU{_ql0epb7NH-C6Kvw"><field name="CODE">data = data_train</field></block></value></block></value></block></value></block></xml>

Now display model predictions on the test data to make sure it worked.

In [21]:
parsnip::predict.model_fit(model,data_test)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="3q]Js%*Alzd]|p|FOe}-">parsnip</variable><variable id="mgo;O)iX^5)A5.@gqIkA">model</variable><variable id="|q$XCeTWL%AdgT|]tbnU">data_test</variable></variables><block type="varDoMethod_R" id="9|=:dCZ9{DKV$n07!1C!" x="-160" y="132"><mutation items="2"></mutation><field name="VAR" id="3q]Js%*Alzd]|p|FOe}-">parsnip</field><field name="MEMBER">predict.model_fit</field><data>parsnip:predict.model_fit</data><value name="ADD0"><block type="variables_get" id="QbNKT15/:0bkQ7|KddE-"><field name="VAR" id="mgo;O)iX^5)A5.@gqIkA">model</field></block></value><value name="ADD1"><block type="variables_get" id="=.XA:UP[M?98fAWf`?)A"><field name="VAR" id="|q$XCeTWL%AdgT|]tbnU">data_test</field></block></value></block></xml>

.pred_class
<fct>
0
0
1
1
0
⋮
0
0
0
0


You should see a mix of `1` and `0` in the predictions. 

**QUESTION:**

Do you think `0` or `1` is more common in this dataset?
What could you do with the dataframe to check?

**ANSWER: (click here to edit)**

*`0` looks more common. An easy way to check would be to use `tabyl` on the dataframe.*

## Classifier evaluation

To see if the model is any good, do some evaluations.

First `augment` the test data with the predictions.

In [22]:
data_evaluation = generics::augment(model,data_test)

data_evaluation

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</variable><variable id="w(9-o9gLSDEJ,]Qt}e!^">generics</variable><variable id="mgo;O)iX^5)A5.@gqIkA">model</variable><variable id="|q$XCeTWL%AdgT|]tbnU">data_test</variable></variables><block type="variables_set" id="qHR/^ulVJ-n)W4{:7-@Y" x="-193" y="204"><field name="VAR" id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</field><value name="VALUE"><block type="varDoMethod_R" id="k:ygkKz,swERp]34_uH{"><mutation items="2"></mutation><field name="VAR" id="w(9-o9gLSDEJ,]Qt}e!^">generics</field><field name="MEMBER">augment</field><data>generics:augment</data><value name="ADD0"><block type="variables_get" id="9A!7aBLNe1]IufZ-R;BY"><field name="VAR" id="mgo;O)iX^5)A5.@gqIkA">model</field></block></value><value name="ADD1"><block type="variables_get" id="qPW(v;lbSJ=RJ)y;hW$)"><field name="VAR" id="|q$XCeTWL%AdgT|]tbnU">data_test</field></block></value></block></value></block><block type="variables_get" id="Um|1+kaS%|p}e=,Pw)MQ" x="-203" y="317"><field name="VAR" id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</field></block></xml>

admit,gre,gpa,rank,.pred_class,.pred_1,.pred_0
<fct>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>
0,600,2.82,4,0,0.3109971,0.6890029
0,680,3.19,4,0,0.3829806,0.6170194
1,760,3.35,2,1,0.6742699,0.3257301
1,780,3.22,2,1,0.8938879,0.1061121
0,520,3.29,1,0,0.4623578,0.5376422
⋮,⋮,⋮,⋮,⋮,⋮,⋮
0,540,2.70,2,0,0.1061121,0.8938879
0,700,3.65,2,0,0.3109971,0.6890029
0,420,3.02,1,0,0.3829806,0.6170194
0,580,3.36,2,0,0.2768684,0.7231316


Next load `yardstick`.

In [23]:
library(yardstick)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="M8O}^6C_fm;DGZt9!{=e">yardstick</variable></variables><block type="import_R" id="~]m5/PaJhO^)r2YX)!Ko" x="-152" y="-34"><field name="libraryName" id="M8O}^6C_fm;DGZt9!{=e">yardstick</field></block></xml>

### Accuracy


And calculate the KNN accuracy.

In [24]:
yardstick::accuracy(data_evaluation,truth=admit,estimate=.pred_class)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="M8O}^6C_fm;DGZt9!{=e">yardstick</variable><variable id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</variable></variables><block type="varDoMethod_R" id="ovJDL$T;GrTBZ,)jMz;a" x="-272" y="142"><mutation items="3"></mutation><field name="VAR" id="M8O}^6C_fm;DGZt9!{=e">yardstick</field><field name="MEMBER">accuracy</field><data>yardstick:accuracy</data><value name="ADD0"><block type="variables_get" id="$#GYCvI1LKXt%Rsb09a}"><field name="VAR" id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="i#@XFD{vr]B47yD52|(B"><field name="CODE">truth=admit</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock_R" id="(x50a~#{oijRotT|Z?8G"><field name="CODE">estimate=.pred_class</field></block></value></block></xml>

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.675


**QUESTION:**

What does the accuracy tell you about the errors the model is making?

**ANSWER: (click here to edit)**

*Accuracy doesn't tell you about a specific type of error; you'd need to use a confusion matrix or other metric to get that information*

**QUESTION:**

What do you expect to happen to accuracy as you increase `prop`?

**ANSWER: (click here to edit)**

*Accuracy should go down as test size increases, because putting more data into test means less data for training, and KNN needs as much training data as possible*

### Recall/Precision per class


Get recall.

In [25]:
yardstick::recall(data_evaluation,truth=admit,estimate=.pred_class)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="M8O}^6C_fm;DGZt9!{=e">yardstick</variable><variable id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</variable></variables><block type="varDoMethod_R" id="ovJDL$T;GrTBZ,)jMz;a" x="-272" y="142"><mutation items="3"></mutation><field name="VAR" id="M8O}^6C_fm;DGZt9!{=e">yardstick</field><field name="MEMBER">recall</field><data>yardstick:recall</data><value name="ADD0"><block type="variables_get" id="$#GYCvI1LKXt%Rsb09a}"><field name="VAR" id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="i#@XFD{vr]B47yD52|(B"><field name="CODE">truth=admit</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock_R" id="(x50a~#{oijRotT|Z?8G"><field name="CODE">estimate=.pred_class</field></block></value></block></xml>

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
recall,binary,0.32


Get precision.

In [27]:
yardstick::precision(data_evaluation,truth=admit,estimate=.pred_class)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="M8O}^6C_fm;DGZt9!{=e">yardstick</variable><variable id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</variable></variables><block type="varDoMethod_R" id="ovJDL$T;GrTBZ,)jMz;a" x="-272" y="142"><mutation items="3"></mutation><field name="VAR" id="M8O}^6C_fm;DGZt9!{=e">yardstick</field><field name="MEMBER">precision</field><data>yardstick:precision</data><value name="ADD0"><block type="variables_get" id="$#GYCvI1LKXt%Rsb09a}"><field name="VAR" id=".zA6Y6Y}9T^S|whSl]|6">data_evaluation</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="i#@XFD{vr]B47yD52|(B"><field name="CODE">truth=admit</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock_R" id="(x50a~#{oijRotT|Z?8G"><field name="CODE">estimate=.pred_class</field></block></value></block></xml>

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
precision,binary,0.4705882


**QUESTION:**

Is there a particular class that KNN does worse on?
Why do you think that might be?

**ANSWER: (click here to edit)**

*About 69% of the `admit`s are `0`, so the classifier has a harder time with the `1`s because they are less common. This is called imbalanced classes and is problem often seen in the real world.*

<!--  -->