# K-Nearest Neighbor (Marvel edition)

In this notebook, we'll:
<ul>
<li>Apply the <i>k</i>-NN algorithm</li>
<li>Use cross-validation</li>
<li>Apply scaling</li>
</ul>

We'll use the Marvel Wikia dataset from http://www.github.com/fivethirtyeight.

## <i>k</i>-Nearest Neighbor

The <b><i>k</i>-Nearest Neighbor (<i>k</i>NN)</b> methodology is a supervised machine learning algorithm used with categorical data.

To review, machine learning is <b>supervised</b> when labeled data has already been provided. The data scientist uses this labeled data to "train" a proposed model, "tests" the remaining data using the trained model, then checks for accuracy.

<b>Categorical</b> data is data that has discrete, rather than continous, values.

To start, let's import all the necessary packages we'll need.

In [2]:
%matplotlib inline
from __future__ import division
import pandas as pd
import numpy as np
from seaborn import plt # TODO(jaydelatorre): Make sure to note what Seaborn is
import matplotlib.pyplot as plt

Next, let's read the dataset into a pandas DataFrame and explore it.

In [3]:
df = pd.read_csv('marvel-wikia-data.csv')

print df.head()

   page_id                                 name  \
0     1678            Spider-Man (Peter Parker)   
1     7139      Captain America (Steven Rogers)   
2    64786  Wolverine (James \"Logan\" Howlett)   
3     1868    Iron Man (Anthony \"Tony\" Stark)   
4     2460                  Thor (Thor Odinson)   

                                   urlslug                ID  \
0              \/Spider-Man_(Peter_Parker)   Secret Identity   
1        \/Captain_America_(Steven_Rogers)   Public Identity   
2  \/Wolverine_(James_%22Logan%22_Howlett)   Public Identity   
3    \/Iron_Man_(Anthony_%22Tony%22_Stark)   Public Identity   
4                    \/Thor_(Thor_Odinson)  No Dual Identity   

                ALIGN         EYE        HAIR              SEX  GSM  \
0     Good Characters  Hazel Eyes  Brown Hair  Male Characters  NaN   
1     Good Characters   Blue Eyes  White Hair  Male Characters  NaN   
2  Neutral Characters   Blue Eyes  Black Hair  Male Characters  NaN   
3     Good Characters   

We see that the dataset has a few features:
<ul>
<li><code>page_id: </code>the ID of the Wikia page for the character</li>
<li><code>name: </code>the name of the character</li>
<li><code>urlslug: </code>the unique URL string for the character's Wikia page</li>
<li><code>ID: </code>describes whether the character's identity is secret, public, or if he/she does not distinguish between either status</li>
<li><code>ALIGN: </code>the character's moral alignment</li>
<li><code>EYE: </code>color of the character's eyes</li>
<li><code>HAIR: </code>color of the character's hair</li>
<li><code>SEX: </code>the character's gender, which may also be asexual or non-sexual</li>
<li><code>GSM: </code>the character's sexual orientation and/or sexual expression; not all characters have a value for this feature</li>
<li><code>ALIVE: </code>whether or not the character is living</li>
<li><code>APPEARANCES: </code>number of times the character has appeared in the Marvel multi-verse</li>
<li><code>FIRST APPEARANCE: </code>month-year when the character first showed up in the Marvel multi-verse</li>
<li><code>Year: </code>year of character's first appearance</li>
</ul>

In [13]:
df.GSM.value_counts()

Homosexual Characters     66
Bisexual Characters       19
Transgender Characters     2
Genderfluid Characters     1
Transvestites              1
Pansexual Characters       1
dtype: int64