# Supervised Learning: Classification of Zoo Animals using K-Nearest Neighbors

## Dataset Information

**Dataset Name:** 'Zoo Data Set'
- **Source:** http://archive.ics.uci.edu/ml/datasets/Zoo
- **Description:** This dataset contains 100 different zoo animals and their 17 features.
- Columns:

Column Name|Data Type|Description
--- | --- | ---
animal name | String | name of the animal
hair | Boolean | 0: hairless, 1: has hair
feathers | Boolean | 0: no feathers, 1: has feathers
eggs | Boolean | 0: no eggs, 1: lays eggs
milk | Boolean | 0: no milk, 1: milk
airborne | Boolean | 0: is not airborne, 1: is airborne
aquatic | Boolean | 0: is not aquatic, 1: is aquatic
predator | Boolean | 0: is not a predator, 1: is a predator
toothed | Boolean | 0: no teeth, 1: has teeth
backbone | Boolean | 0: no backbone, 1: has backbone
breathes | Boolean | 0: does not breathe, 1: breathes
venomous | Boolean | 0: is not venomous, 1: is venomous
fins | Boolean | 0: no fins, 1: has fins
legs | Numeric (set of values: {0,2,4,5,6,8}) | number of legs
tail | Boolean | 0: no tail, 1: has tail
domestic | Boolean | 0: not domestic, 1: is domestic
catsize | Boolean | 0: not catsize, 1: catsize
type | Numeric (integer values in range [1,7]) | category of animal it belongs to


- 'type' column

'type' value | Number of Animals in the category | Animals in the category
--- | --- | ---
1 |(41) | aardvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, deer, dolphin, elephant, fruitbat, giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, seal, sealion, squirrel, vampire, vole, wallaby,wolf
2 | (20) | chicken, crow, dove, duck, flamingo, gull, hawk, kiwi, lark, ostrich, parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, swan, vulture, wren
3 | (5) | pitviper, seasnake, slowworm, tortoise, tuatara
4 | (13) | bass, carp, catfish, chub, dogfish, haddock, herring, pike, piranha, seahorse, sole, stingray, tuna
5 | (4) | frog, frog, newt, toad
6 | (8) | flea, gnat, honeybee, housefly, ladybird, moth, termite, wasp
7 | (10) | clam, crab, crayfish, lobster, octopus, scorpion, seawasp, slug, starfish, worm

## Setup
Importing packages and importing the dataset.

In [1]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
# import dataset

# dataset does not initially have column names, so add them in
column_names = ['animal name', 'hair', 'feathers', 'eggs', 'milk', 
                'airborne', 'aquatic', 'predator', 'toothed', 'backbone', 
                'breathes', 'venomous', 'fins', 'legs', 'tail',
                'domestic', 'catsize', 'type']
df = pd.read_csv('zoo.csv', names = column_names)
df = df.set_index('animal name')
df.head()

Unnamed: 0_level_0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
animal name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


### Split the Data into Features and Labels
Next, split up the dataset from the labels. In the case of this dataset, we want to classify what 'type' each animal belongs to. So the labels will be the 'type' column and the features will be all the remaining columns.

In [3]:
features = df[['hair', 'feathers', 'eggs', 'milk', 
                'airborne', 'aquatic', 'predator', 'toothed', 'backbone', 
                'breathes', 'venomous', 'fins', 'legs', 'tail',
                'domestic', 'catsize']]
labels = df['type']

### Normalize the data
Use min-max normalization to ensure that no one feature carries more weight than another.

In [4]:
# min_max_normalize returns normalized the data in lst
def min_max_normalize(lst):
    # find the minimum and maximum of lst
    minimum = min(lst)
    maximum = max(lst)
    # store the new normalized values
    normalized = []
    # loop through all the elements in lst
    for i in range(len(lst)):
        # apply min-max normalization formula to each element
        normalized.append((lst[i] - minimum)/(maximum - minimum))
    return normalized

Since all the columns are of type boolean except for the 'legs' category, we only need to normalize the 'legs' column.

In [5]:
# normalize the data
legs_normalized = min_max_normalize(np.array(features['legs']))
features.loc[:, 'legs'] = np.array(legs_normalized).reshape(-1,1)
features

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0_level_0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize
animal name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
aardvark,1,0,0,1,0,0,1,1,1,1,0,0,0.50,0,0,1
antelope,1,0,0,1,0,0,0,1,1,1,0,0,0.50,1,0,1
bass,0,0,1,0,0,1,1,1,1,0,0,1,0.00,1,0,0
bear,1,0,0,1,0,0,1,1,1,1,0,0,0.50,0,0,1
boar,1,0,0,1,0,0,1,1,1,1,0,0,0.50,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wallaby,1,0,0,1,0,0,0,1,1,1,0,0,0.25,1,0,1
wasp,1,0,1,0,1,0,0,0,0,1,1,0,0.75,0,0,0
wolf,1,0,0,1,0,0,1,1,1,1,0,0,0.50,1,0,1
worm,0,0,1,0,0,0,0,0,0,1,0,0,0.00,0,0,0


### Split the Data into Training Set and Test Set
Split the data into a training set and a test set so we can evaluate the performance of the classifier. Here we will be using 80% of the data for the training set and 20% of the data for the test set.

In [7]:
# import train_test_split to split up data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, 
                                                    train_size=0.8, test_size=0.2,
                                                    random_state=1)

In [8]:
# check that the size of the training set is accurate
print(len(X_train))
print(len(y_train))

80
80


## Implement K-Nearest Neighbors

In [9]:
def distance(animal1, animal2):
    squared_diff = 0
    for i in range(len(animal1)):
        squared_diff += (animal1[i] - animal2[i]) ** 2
    return squared_diff ** 0.5

def predict(unknown, dataset, labels, k):
    distances = []
    #TODO
    