Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# KNN classification: Problem solving

In this session, you'll work through a complete example using a new dataset, `binary`.

## Load the dataframe

The `binary.csv` dataset contains 4 variables:

| Variable    | Type    | Description           |
|:-------------|:---------|:-----------------------|
| admit | Nominal   | the admittance status (0=not admitted, 1=admitted) |
| gre  | Ratio   | the student's GRE score  |
| gpa | Ratio   | the student's GPA |
| rank  | Ordinal   | rank of the institution (1=highest to 4=lowest prestige)  |


Start by importing `pandas`.

In [13]:
import pandas as pd

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="/%IFN5#t2uAm}E`8:KV:">pd</variable></variables><block type="importAs" id="_@5-r*j4E`}d?=DXLinf" x="16" y="10"><field name="libraryName">pandas</field><field name="VAR" id="/%IFN5#t2uAm}E`8:KV:">pd</field></block></xml>

Load a dataframe with `binary.csv` and display the dataframe.

In [14]:
dataframe = pd.read_csv('datasets/binary.csv')

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(*,(U(rm+VG0+vg;w$65">dataframe</variable><variable id="/%IFN5#t2uAm}E`8:KV:">pd</variable></variables><block type="variables_set" id="gM*jw`FfIR3)8=g0iEB7" x="-12" y="162"><field name="VAR" id="(*,(U(rm+VG0+vg;w$65">dataframe</field><value name="VALUE"><block type="varDoMethod" id="ny0sjvqTnn2B]K2za7Li"><mutation items="1"></mutation><field name="VAR" id="/%IFN5#t2uAm}E`8:KV:">pd</field><field name="MEMBER">read_csv</field><data>pd:read_csv</data><value name="ADD0"><block type="text" id="dfrpI5b@DHr+DQ:|@vpv"><field name="TEXT">datasets/binary.csv</field></block></value></block></value></block><block type="variables_get" id="tKXDNJFh}!c~`YX;)~{u" x="8" y="330"><field name="VAR" id="(*,(U(rm+VG0+vg;w$65">dataframe</field></block></xml>

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.00,1
3,1,640,3.19,4
4,0,520,2.93,4
...,...,...,...,...
395,0,620,4.00,2
396,0,560,3.04,3
397,0,460,2.63,2
398,0,700,3.65,2


## Prepare the train/test data

To train the classifiers, you need to split the dataframe into training data and testing data.

Start by creating a dataframe `Y` that just has `admit` in it, and then display `Y` so you can be sure it worked.

In [15]:
Y = dataframe[['admit']]

Y

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="@83Gxqb{/h|%%9Yr?2q!">Y</variable><variable id="(*,(U(rm+VG0+vg;w$65">dataframe</variable></variables><block type="variables_set" id="46~F|Y~BiXd|Vok0}p,p" x="33" y="69"><field name="VAR" id="@83Gxqb{/h|%%9Yr?2q!">Y</field><value name="VALUE"><block type="indexer" id="p,##xUo8Nj3`vpMVBjr8"><field name="VAR" id="(*,(U(rm+VG0+vg;w$65">dataframe</field><value name="INDEX"><block type="lists_create_with" id="a.wA7Xdx%GbQ)t#XDKvV"><mutation items="1"></mutation><value name="ADD0"><block type="text" id="aZG|GLqavO~,RXBOcuJ+"><field name="TEXT">admit</field></block></value></block></value></block></value></block><block type="variables_get" id="W]bu*LAe{2;n#?+7WQ-[" x="30" y="219"><field name="VAR" id="@83Gxqb{/h|%%9Yr?2q!">Y</field></block></xml>

Unnamed: 0,admit
0,0
1,1
2,1
3,1
4,0
...,...
395,0
396,0
397,0
398,0


Next do the same thing for `X` using the other columns in the dataframe.

In [16]:
X = dataframe[['gre', 'gpa', 'rank']]

X

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="py,.kr!f9i=.v7I5_d+^">X</variable><variable id="(*,(U(rm+VG0+vg;w$65">dataframe</variable></variables><block type="variables_set" id="46~F|Y~BiXd|Vok0}p,p" x="33" y="69"><field name="VAR" id="py,.kr!f9i=.v7I5_d+^">X</field><value name="VALUE"><block type="indexer" id="p,##xUo8Nj3`vpMVBjr8"><field name="VAR" id="(*,(U(rm+VG0+vg;w$65">dataframe</field><value name="INDEX"><block type="lists_create_with" id="a.wA7Xdx%GbQ)t#XDKvV"><mutation items="3"></mutation><value name="ADD0"><block type="text" id="aZG|GLqavO~,RXBOcuJ+"><field name="TEXT">gre</field></block></value><value name="ADD1"><block type="text" id="b/mW9(0w,rS.ny;Y(]5A"><field name="TEXT">gpa</field></block></value><value name="ADD2"><block type="text" id="4yc?!4pnBI1/rSBQ41!t"><field name="TEXT">rank</field></block></value></block></value></block></value></block><block type="variables_get" id="W]bu*LAe{2;n#?+7WQ-[" x="30" y="219"><field name="VAR" id="py,.kr!f9i=.v7I5_d+^">X</field></block></xml>

Unnamed: 0,gre,gpa,rank
0,380,3.61,3
1,660,3.67,3
2,800,4.00,1
3,640,3.19,4
4,520,2.93,4
...,...,...,...
395,620,4.00,2
396,560,3.04,3
397,460,2.63,2
398,700,3.65,2


To split the data into training and testing data, import `model_selection`.

In [17]:
import sklearn.model_selection as model_selection

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="uASGz64Zb$AOvQyV4pRj">model_selection</variable></variables><block type="importAs" id="sN1YO5FEzpHyxb31@j,Z" x="16" y="10"><field name="libraryName">sklearn.model_selection</field><field name="VAR" id="uASGz64Zb$AOvQyV4pRj">model_selection</field></block></xml>

Now split the data into training and testing data, using `test_size` at one of the following 3 values depending on your birthday:

If your birthday is in:

- Jan, Feb, Mar, Apr, use `0.2` 
- May, Jun, Jul, Aug, use `0.4`
- Sep, Oct, Nov, Dec, use `0.6`

So depending on your birthday, we'll use 20, 40, or 60% of the data for testing.

In [18]:
splits = model_selection.train_test_split(X,Y,test_size=0.2)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="_ut$e0PL4OMi4o1MXTpw">splits</variable><variable id="uASGz64Zb$AOvQyV4pRj">model_selection</variable><variable id="py,.kr!f9i=.v7I5_d+^">X</variable><variable id="@83Gxqb{/h|%%9Yr?2q!">Y</variable></variables><block type="variables_set" id="oTGRJ#{R!U^we@Bl@pkT" x="30" y="136"><field name="VAR" id="_ut$e0PL4OMi4o1MXTpw">splits</field><value name="VALUE"><block type="varDoMethod" id="f?j@ker(a#hJv;Nh)IGX"><mutation items="3"></mutation><field name="VAR" id="uASGz64Zb$AOvQyV4pRj">model_selection</field><field name="MEMBER">train_test_split</field><data>model_selection:train_test_split</data><value name="ADD0"><block type="variables_get" id=".mm}`*H4)i%Eq5z={e-$"><field name="VAR" id="py,.kr!f9i=.v7I5_d+^">X</field></block></value><value name="ADD1"><block type="variables_get" id="I3dOV;CPBf^~E%BvgthZ"><field name="VAR" id="@83Gxqb{/h|%%9Yr?2q!">Y</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock" id="@Hg?ib/!8fH$;f3pWJy2"><field name="CODE">test_size=0.2</field></block></value></block></value></block></xml>

## KNN

First import `neighbors`.

In [19]:
import numpy as np
import sklearn.neighbors as neighbors

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Zhzp)s*VL?V@ES3(j:*b">np</variable><variable id="~%[Y}a{Syr+LrQ[I8?d(">neighbors</variable></variables><block type="importAs" id="f8xvj%e=mr_kR7(?kiiO" x="118" y="284"><field name="libraryName">numpy</field><field name="VAR" id="Zhzp)s*VL?V@ES3(j:*b">np</field><next><block type="importAs" id="ayw$B{(evWs,ynm/fP%g"><field name="libraryName">sklearn.neighbors</field><field name="VAR" id="~%[Y}a{Syr+LrQ[I8?d(">neighbors</field></block></next></block></xml>

Next define the KNN model, e.g. using `create`.

In [20]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="qiYg#10P*[{cvl~,g!4n">knn</variable><variable id="~%[Y}a{Syr+LrQ[I8?d(">neighbors</variable></variables><block type="variables_set" id="`z.|tb|jX$Y2a5;iN{pz" x="9" y="278"><field name="VAR" id="qiYg#10P*[{cvl~,g!4n">knn</field><value name="VALUE"><block type="varCreateObject" id="9i7]OHHOnhf2~MjH^n+y"><mutation items="1"></mutation><field name="VAR" id="~%[Y}a{Syr+LrQ[I8?d(">neighbors</field><field name="MEMBER">KNeighborsClassifier</field><data>neighbors:KNeighborsClassifier</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="R*)t!8POzxw3}@cy,{:]"><field name="CODE">n_neighbors=5</field></block></value></block></value></block></xml>

Now train KNN and do the predictions in one cell.
Save the predictions in the variable `predictions`, and then show the predictions to make sure it worked.

In [21]:
knn.fit(splits[0],np.ravel(splits[2]))

predictions = knn.predict(splits[1])

predictions

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="qiYg#10P*[{cvl~,g!4n">knn</variable><variable id="S${cx6mpro:A3F6)k7GU">predictions</variable><variable id="Zhzp)s*VL?V@ES3(j:*b">np</variable><variable id="_ut$e0PL4OMi4o1MXTpw">splits</variable></variables><block type="varDoMethod" id="8Ljs~haYrP^|UkSn*ZFY" x="14" y="249"><mutation items="2"></mutation><field name="VAR" id="qiYg#10P*[{cvl~,g!4n">knn</field><field name="MEMBER">fit</field><data>knn:fit</data><value name="ADD0"><block type="lists_getIndex" id="1MeJkC]YhSKb1g:QUq!i"><mutation statement="false" at="true"></mutation><field name="MODE">GET</field><field name="WHERE">FROM_START</field><value name="VALUE"><block type="variables_get" id="^R/0R7yd6JHXc%c++34J"><field name="VAR" id="_ut$e0PL4OMi4o1MXTpw">splits</field></block></value><value name="AT"><block type="math_number" id="jLV:Y45[^:fBT{|!(.r`"><field name="NUM">1</field></block></value></block></value><value name="ADD1"><block type="varDoMethod" id=";tG1YnMBwV?P4hBmIFcK"><mutation items="1"></mutation><field name="VAR" id="Zhzp)s*VL?V@ES3(j:*b">np</field><field name="MEMBER">ravel</field><data>np:ravel</data><value name="ADD0"><block type="lists_getIndex" id="k/?QqU$23,hhQq.^1(BU"><mutation statement="false" at="true"></mutation><field name="MODE">GET</field><field name="WHERE">FROM_START</field><value name="VALUE"><block type="variables_get" id="*;}}|a4^CJemb:I]=)A5"><field name="VAR" id="_ut$e0PL4OMi4o1MXTpw">splits</field></block></value><value name="AT"><block type="math_number" id="pB6/r)Wezpc@A.wR5z-a"><field name="NUM">3</field></block></value></block></value></block></value></block><block type="variables_set" id="4G|k@2K0-I7/kpl}nfJ3" x="13" y="402"><field name="VAR" id="S${cx6mpro:A3F6)k7GU">predictions</field><value name="VALUE"><block type="varDoMethod" id="GDo@ju2Q3G[a0FgMg8kA"><mutation items="1"></mutation><field name="VAR" id="qiYg#10P*[{cvl~,g!4n">knn</field><field name="MEMBER">predict</field><data>knn:predict</data><value name="ADD0"><block type="lists_getIndex" id="fDJ#sl}z|qvH1g;IEi6|"><mutation statement="false" at="true"></mutation><field name="MODE">GET</field><field name="WHERE">FROM_START</field><value name="VALUE"><block type="variables_get" id="eAgJ9|+-=R-~_UL0WE{G"><field name="VAR" id="_ut$e0PL4OMi4o1MXTpw">splits</field></block></value><value name="AT"><block type="math_number" id="t0mtofEKF~@c0WrN/K{N"><field name="NUM">2</field></block></value></block></value></block></value></block><block type="variables_get" id="wRW`Td.q2HYCQP]__kwz" x="11" y="500"><field name="VAR" id="S${cx6mpro:A3F6)k7GU">predictions</field></block></xml>

array([0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

You should see a mix of `1` and `0` in the predictions. 

**QUESTION:**

Do you think `0` or `1` is more common in this dataset?
What could you do with the dataframe to check?

**ANSWER: (click here to edit)**

*`0` looks more common. An easy way to check would be to use `describe` on the dataframe.*

<hr>

## Classifier evaluation

To see if the model is any good, do some evaluations.

First import `metrics`.

In [22]:
import sklearn.metrics as metrics

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="C8HQ^p):qDkQA/nhxe{{">metrics</variable></variables><block type="importAs" id="|1fEPj_#9:@0Qa@y|1F." x="135" y="207"><field name="libraryName">sklearn.metrics</field><field name="VAR" id="C8HQ^p):qDkQA/nhxe{{">metrics</field></block></xml>

### Accuracy


And calculate the KNN accuracy.

In [23]:
metrics.accuracy_score(splits[3],predictions)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="C8HQ^p):qDkQA/nhxe{{">metrics</variable><variable id="S${cx6mpro:A3F6)k7GU">predictions</variable><variable id="_ut$e0PL4OMi4o1MXTpw">splits</variable></variables><block type="varDoMethod" id="l,28Y;cf6Q.r-rTBWP4}" x="76" y="293"><mutation items="2"></mutation><field name="VAR" id="C8HQ^p):qDkQA/nhxe{{">metrics</field><field name="MEMBER">accuracy_score</field><data>metrics:accuracy_score</data><value name="ADD0"><block type="lists_getIndex" id="6Nb!QUCs+zcSbgi2=8Jb"><mutation statement="false" at="true"></mutation><field name="MODE">GET</field><field name="WHERE">FROM_START</field><value name="VALUE"><block type="variables_get" id="0SesC-O%M@-rqWgnWJ[q"><field name="VAR" id="_ut$e0PL4OMi4o1MXTpw">splits</field></block></value><value name="AT"><block type="math_number" id="ZQBTZmCAMzn[;:irh:{@"><field name="NUM">4</field></block></value></block></value><value name="ADD1"><block type="variables_get" id=":5r(/A;QY|,.0/gzcHi."><field name="VAR" id="S${cx6mpro:A3F6)k7GU">predictions</field></block></value></block></xml>

0.5625

**QUESTION:**

What does the accuracy tell you about the errors the model is making?

**ANSWER: (click here to edit)**

*Accuracy doesn't tell you about a specific type of error; you'd need to use a confusion matrix or other metric to get that information*
<hr>

**QUESTION:**

What do you expect to happen to accuracy as you increase `test_size`?

**ANSWER: (click here to edit)**

*Accuracy should go down as test size increases, because putting more data into test means less data for training, and KNN needs as much training data as possible*

<hr>

### Recall/Precision per class


Get the KNN `classification_report`.

In [24]:
print(metrics.classification_report(splits[3],predictions))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="C8HQ^p):qDkQA/nhxe{{">metrics</variable><variable id="S${cx6mpro:A3F6)k7GU">predictions</variable><variable id="_ut$e0PL4OMi4o1MXTpw">splits</variable></variables><block type="text_print" id="Sx0uF}9IfzkucZiyR^1:" x="9" y="224"><value name="TEXT"><block type="varDoMethod" id="l,28Y;cf6Q.r-rTBWP4}"><mutation items="2"></mutation><field name="VAR" id="C8HQ^p):qDkQA/nhxe{{">metrics</field><field name="MEMBER">classification_report</field><data>metrics:classification_report</data><value name="ADD0"><block type="lists_getIndex" id="6Nb!QUCs+zcSbgi2=8Jb"><mutation statement="false" at="true"></mutation><field name="MODE">GET</field><field name="WHERE">FROM_START</field><value name="VALUE"><block type="variables_get" id="0SesC-O%M@-rqWgnWJ[q"><field name="VAR" id="_ut$e0PL4OMi4o1MXTpw">splits</field></block></value><value name="AT"><block type="math_number" id="ZQBTZmCAMzn[;:irh:{@"><field name="NUM">4</field></block></value></block></value><value name="ADD1"><block type="variables_get" id=":5r(/A;QY|,.0/gzcHi."><field name="VAR" id="S${cx6mpro:A3F6)k7GU">predictions</field></block></value></block></value></block></xml>

              precision    recall  f1-score   support

           0       0.66      0.70      0.68        53
           1       0.33      0.30      0.31        27

    accuracy                           0.56        80
   macro avg       0.50      0.50      0.50        80
weighted avg       0.55      0.56      0.56        80



**QUESTION:**

Is there a particular class that KNN does worse on?
Why do you think that might be?

**ANSWER: (click here to edit)**

*About 69% of the `admit`s are `0`, so the classifier has a harder time with the `1`s because they are less common. This is called imbalanced classes and is problem often seen in the real world.*


<hr>

<!--  -->